From owner-freebsd-arch@FreeBSD.ORG  Sun Aug 17 01:26:58 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id DEE43BE6;
 Sun, 17 Aug 2014 01:26:57 +0000 (UTC)
Received: from mail-wg0-x22b.google.com (mail-wg0-x22b.google.com
 [IPv6:2a00:1450:400c:c00::22b])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 4917A23D8;
 Sun, 17 Aug 2014 01:26:57 +0000 (UTC)
Received: by mail-wg0-f43.google.com with SMTP id l18so3612006wgh.14
 for <multiple recipients>; Sat, 16 Aug 2014 18:26:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-type:content-disposition:content-transfer-encoding
 :in-reply-to:user-agent;
 bh=32AQpC+aVQbFOxyzwAsARKVHWmDo+UsYIP8IUKUFVX8=;
 b=bdhCwnryhRY/P5L1y/SFV53kx5K7U2XHkeEL/vtHaxFrFXMYSH6qFtx2EMYt6TKM+Z
 cv3+76otQKRB3BHB+kgdowzjVgGdTzCWtPOwz1bVWeP7N4iN3jbbl5lrkb/j+9D8jIb+
 QHUcQy9DsNkpxhhnS5snoSqEqysgOh4vUYdfwqPU0lwXo9EvhzrTlpUfBH/JFCpcYDM3
 svuCrBWHnx18jT7BSU4jmFK936V4Rj32QKLjwAQ6L3SmlZbPDq/ds2cHffqvipbnSrYE
 6EdvQ0vF+7ayYecSrb5RraZafYa/oRjrfWcp9K+Miurwsew9ejE2ypeZvL+bTZu0WDU2
 tXGw==
X-Received: by 10.180.89.100 with SMTP id bn4mr18059090wib.34.1408238815528;
 Sat, 16 Aug 2014 18:26:55 -0700 (PDT)
Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net.
 [2001:470:1f08:1f7::2])
 by mx.google.com with ESMTPSA id w1sm22141460wiz.14.2014.08.16.18.26.54
 for <multiple recipients>
 (version=TLSv1.2 cipher=RC4-SHA bits=128/128);
 Sat, 16 Aug 2014 18:26:54 -0700 (PDT)
Date: Sun, 17 Aug 2014 03:26:47 +0200
From: Mateusz Guzik <mjguzik@gmail.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: [PATCH 1/2] Implement simple sequence counters with memory
 barriers.
Message-ID: <20140817012646.GA21025@dft-labs.eu>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <1408064112-573-2-git-send-email-mjguzik@gmail.com>
 <20140816093811.GX2737@kib.kiev.ua>
 <20140816185406.GD2737@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20140816185406.GD2737@kib.kiev.ua>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: Johan Schuijt <johan@transip.nl>, freebsd-arch@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 17 Aug 2014 01:26:58 -0000

On Sat, Aug 16, 2014 at 09:54:06PM +0300, Konstantin Belousov wrote:
> On Sat, Aug 16, 2014 at 12:38:11PM +0300, Konstantin Belousov wrote:
> > On Fri, Aug 15, 2014 at 02:55:11AM +0200, Mateusz Guzik wrote:
> > > ---
> > >  sys/sys/seq.h | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 126 insertions(+)
> > >  create mode 100644 sys/sys/seq.h
> > > 
> > > diff --git a/sys/sys/seq.h b/sys/sys/seq.h
> > > new file mode 100644
> > > index 0000000..0971aef
> > > --- /dev/null
> > > +++ b/sys/sys/seq.h
[..]
> > > +#ifndef _SYS_SEQ_H_
> > > +#define _SYS_SEQ_H_
> > > +
> > > +#ifdef _KERNEL
> > > +
> > > +/*
> > > + * Typical usage:
> > > + *
> > > + * writers:
> > > + * 	lock_exclusive(&obj->lock);
> > > + * 	seq_write_begin(&obj->seq);
> > > + * 	.....
> > > + * 	seq_write_end(&obj->seq);
> > > + * 	unlock_exclusive(&obj->unlock);
> > > + *
> > > + * readers:
> > > + * 	obj_t lobj;
> > > + * 	seq_t seq;
> > > + *
> > > + * 	for (;;) {
> > > + * 		seq = seq_read(&gobj->seq);
> > > + * 		lobj = gobj;
> > > + * 		if (seq_consistent(&gobj->seq, seq))
> > > + * 			break;
> > > + * 		cpu_spinwait();
> > > + * 	}
> > > + * 	foo(lobj);
> > > + */		
> > > +
> > > +typedef uint32_t seq_t;
> > > +
> > > +/* A hack to get MPASS macro */
> > > +#include <sys/systm.h>
> > > +#include <sys/lock.h>
> > > +
> > > +#include <machine/cpu.h>
> > > +
> > > +static __inline bool
> > > +seq_in_modify(seq_t seqp)
> > > +{
> > > +
> > > +	return (seqp & 1);
> > > +}
> > > +
> > > +static __inline void
> > > +seq_write_begin(seq_t *seqp)
> > > +{
> > > +
> > > +	MPASS(!seq_in_modify(*seqp));
> > > +	(*seqp)++;
> > > +	wmb();
> > This probably ought to be written as atomic_add_rel_int(seqp, 1);
> Alan Cox rightfully pointed out that better expression is
> v = *seqp + 1;                                                                  
> atomic_store_rel_int(seqp, v);
> which also takes care of TSO on x86.
> 

Well, my memory-barrier-and-so-on-fu is rather weak.

I had another look at the issue. At least on amd64, it looks like only
compiler barrier is required for both reads and writes.

According to AMD64 Architecture Programmer’s Manual Volume 2: System
Programming, 7.2 Multiprocessor Memory Access Ordering states:

"Loads do not pass previous loads (loads are not reordered). Stores do
not pass previous stores (stores are not reordered)"

Since the code modifying stuff only performs a series of writes and we
expect exclusive writers, I find it applicable to this scenario.

I checked linux sources and generated assembly, they indeed issue only
a compiler barrier on amd64 (and for intel processors as well).

atomic_store_rel_int on amd64 seems fine in this regard, but the only
function for loads issues lock cmpxhchg which kills performance
(median 55693659 -> 12789232 ops in a microbenchmark) for no gain.

Additionally release and acquire semantics seems to be a stronger than
needed guarantee.

As far as sequence counters go, we should be able to get away with
making the following:
- all relevant reads are performed between given points
- all relevant writes are performed between given points

As such, I propose introducing another atomic_* function variants
(or stealing smp_{w,r,}mb idea from linux) which provide just that.

So for amd64 reading guarantee and writing guarantee could be provided
in the same way with a compiler barrier.

> > Same note for all other linux-style barriers.  In fact, on x86
> > wmb() is sfence and it serves no useful purpose in seq_write*.
> > 
> > Overall, it feels too alien and linux-ish for my taste.
> > Since we have sequence bound to some lock anyway, could we introduce
> > some sort of generation-aware locks variants, which extend existing
> > locks, and where lock/unlock bump generation number ?
> Still, merging it to the guts of lock implementation is right
> approach, IMO.
> 

Current usage would be along with filedesc (sx) lock. The lock protects
writes to entire fd table (and lock holders can block in malloc), while
each file descriptor has its own counter. Also areas covered by seq are
short and cannot block.

As such, I don't really see any way to merge the lock with the counter.

I agree it would be useful, provided area protected by the lock would be
the same as the one protected by the counter. If this code hits the tree
and one day turns out someone needs such functionality, there should not
be any problems (apart from time effort) in implementing this.

-- 
Mateusz Guzik <mjguzik gmail.com>

From owner-freebsd-arch@FreeBSD.ORG  Sun Aug 17 10:44:34 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 57FF3513;
 Sun, 17 Aug 2014 10:44:34 +0000 (UTC)
Received: from mail.beastielabs.net (unknown
 [IPv6:2001:888:1227:0:200:24ff:fec9:5934])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id B0C0D25AB;
 Sun, 17 Aug 2014 10:44:33 +0000 (UTC)
Received: from beastie.hotsoft.nl (beastie.hotsoft.nl
 [IPv6:2001:888:1227:0:219:d1ff:fee8:91eb])
 by mail.beastielabs.net (8.14.7/8.14.7) with ESMTP id s7HAiUnN059437;
 Sun, 17 Aug 2014 12:44:30 +0200 (CEST)
 (envelope-from hans@beastielabs.net)
Message-ID: <53F0878E.3000401@beastielabs.net>
Date: Sun, 17 Aug 2014 12:44:30 +0200
From: Hans Ottevanger <hans@beastielabs.net>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:31.0) Gecko/20100101 Thunderbird/31.0
MIME-Version: 1.0
To: =?UTF-8?B?RWR3YXJkIFRvbWFzeiBOYXBpZXJhxYJh?= <trasz@FreeBSD.org>
Subject: Re: [CFT] Autofs.
References: <20140730071933.GA20122@pc5.home>
In-Reply-To: <20140730071933.GA20122@pc5.home>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Cc: freebsd-current@FreeBSD.org, freebsd-arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 17 Aug 2014 10:44:34 -0000

On 07/30/14 09:19, Edward Tomasz Napierała wrote:
> At the link below you will find a patch that adds the new automounter.
> The patch is against yesterdays 11.0-CURRENT.
>
> http://people.freebsd.org/~trasz/autofs-head-20140729.diff
>
> Slides that explain the project scope and deliverables are here:
>
> http://people.freebsd.org/~trasz/autofs.pdf
>
> Testing is welcome.  Please start with manual pages, eg. automount(8).
> Note that you need not only to rebuild both kernel and world, but also
> to run mergemaster, to install required /etc files.  To run at startup,
> add 'autofs_enable="YES"' to /etc/rc.conf.
>
> This project is being sponsored by FreeBSD Foundation.
>

Hi!

Great to see a real autofs finally coming to FreeBSD.

I already did some very cursory testing on a recent 11-CURRENT system 
that I still happened to have and things with at least the /net map look 
quite OK.

I could do some more extensive testing if I could use some of my 
10-STABLE systems. I already checked that the patch applies cleanly to a 
recent 10-STABLE (modulo a few offsets) and that both buildworld and 
buildkernel succeed. Should I expect difficulties actually running your 
autofs on 10-STABLE?

And do you plan support for NIS? I know NIS is quite dead and has been 
so for at least 20 years, but I still see it being used occasionally 
(probably most out of habit) and it is (still ?) available in the 
base-system.

Kind regards,

Hans


From owner-freebsd-arch@FreeBSD.ORG  Sun Aug 17 13:22:58 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 16F24A7D;
 Sun, 17 Aug 2014 13:22:58 +0000 (UTC)
Received: from outpost1.zedat.fu-berlin.de (outpost1.zedat.fu-berlin.de
 [130.133.4.66])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id A5E4327FF;
 Sun, 17 Aug 2014 13:22:57 +0000 (UTC)
Received: from inpost2.zedat.fu-berlin.de ([130.133.4.69])
 by outpost.zedat.fu-berlin.de (Exim 4.82) with esmtp
 (envelope-from <ohartman@zedat.fu-berlin.de>)
 id <1XJ0QB-000LJR-7e>; Sun, 17 Aug 2014 15:22:55 +0200
Received: from g229053128.adsl.alicedsl.de ([92.229.53.128]
 helo=thor.walstatt.dynvpn.de)
 by inpost2.zedat.fu-berlin.de (Exim 4.82) with esmtpsa
 (envelope-from <ohartman@zedat.fu-berlin.de>)
 id <1XJ0QB-002tJx-3u>; Sun, 17 Aug 2014 15:22:55 +0200
Date: Sun, 17 Aug 2014 15:22:54 +0200
From: "O. Hartmann" <ohartman@zedat.fu-berlin.de>
To: Hans Ottevanger <hans@beastielabs.net>
Subject: Re: [CFT] Autofs.
Message-ID: <20140817152254.1e2786db.ohartman@zedat.fu-berlin.de>
In-Reply-To: <53F0878E.3000401@beastielabs.net>
References: <20140730071933.GA20122@pc5.home>
 <53F0878E.3000401@beastielabs.net>
Organization: FU Berlin
X-Mailer: Claws Mail 3.10.1 (GTK+ 2.24.22; amd64-portbld-freebsd11.0)
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 boundary="Sig_//tezXkiRzb233=u3GdmP179"; protocol="application/pgp-signature"
X-Originating-IP: 92.229.53.128
X-ZEDAT-Hint: A
Cc: freebsd-current@FreeBSD.org,
 Edward Tomasz =?utf-8?Q?Napiera=C5=82a?= <trasz@FreeBSD.org>,
 freebsd-arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 17 Aug 2014 13:22:58 -0000

--Sig_//tezXkiRzb233=u3GdmP179
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Am Sun, 17 Aug 2014 12:44:30 +0200
Hans Ottevanger <hans@beastielabs.net> schrieb:

> On 07/30/14 09:19, Edward Tomasz Napiera=C5=82a wrote:
> > At the link below you will find a patch that adds the new automounter.
> > The patch is against yesterdays 11.0-CURRENT.
> >
> > http://people.freebsd.org/~trasz/autofs-head-20140729.diff
> >
> > Slides that explain the project scope and deliverables are here:
> >
> > http://people.freebsd.org/~trasz/autofs.pdf
> >
> > Testing is welcome.  Please start with manual pages, eg. automount(8).
> > Note that you need not only to rebuild both kernel and world, but also
> > to run mergemaster, to install required /etc files.  To run at startup,
> > add 'autofs_enable=3D"YES"' to /etc/rc.conf.
> >
> > This project is being sponsored by FreeBSD Foundation.
> >
>=20
> Hi!
>=20
> Great to see a real autofs finally coming to FreeBSD.
>=20
> I already did some very cursory testing on a recent 11-CURRENT system=20
> that I still happened to have and things with at least the /net map look=
=20
> quite OK.
>=20
> I could do some more extensive testing if I could use some of my=20
> 10-STABLE systems. I already checked that the patch applies cleanly to a=
=20
> recent 10-STABLE (modulo a few offsets) and that both buildworld and=20
> buildkernel succeed. Should I expect difficulties actually running your=20
> autofs on 10-STABLE?
>=20
> And do you plan support for NIS? I know NIS is quite dead and has been=20
> so for at least 20 years, but I still see it being used occasionally=20
> (probably most out of habit) and it is (still ?) available in the=20
> base-system.
>=20
> Kind regards,
>=20
> Hans
>=20
> _______________________________________________
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"

Is this "new" autofs of the same type and concept as the autofs used in Lin=
ux for more
than a decade now?

--Sig_//tezXkiRzb233=u3GdmP179
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBAgAGBQJT8KyuAAoJEOgBcD7A/5N8wUcIALO/3aHJq2q2udeRrHvvX552
0LTB1pRdaNzFYWP8obX6D0eMmpc6qkBAYQ3FjVWfDI3bBctMJQOM3949jIpBJ6ET
0UGyDsdx0wCkxDL69vf7AJ1G4ECZuckpgIzhczXMrUaz7oEPL8cSoJdtYhbARayU
Mv7/YqFvoYvBuWI80g3dLmXTxOKXTZcC9SWPeJNC/njrJOtCxn8cevz6gMBp3fLS
/uqt3jLXYbkK+cDxhE5Rm7CNdjdkJfsFbX1a/4mUXM+3yX0onMeL5fVahEtyiye/
d4RokjF2VVNgUyMt4RyRshLKI48O7JfQ57AK+IO0xM+HAg/s1vFybzSTjVVhxVU=
=cZaN
-----END PGP SIGNATURE-----

--Sig_//tezXkiRzb233=u3GdmP179--

From owner-freebsd-arch@FreeBSD.ORG  Sun Aug 17 14:51:02 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 862F73ED;
 Sun, 17 Aug 2014 14:51:02 +0000 (UTC)
Received: from mail-la0-x235.google.com (mail-la0-x235.google.com
 [IPv6:2a00:1450:4010:c03::235])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id D32C2205D;
 Sun, 17 Aug 2014 14:51:01 +0000 (UTC)
Received: by mail-la0-f53.google.com with SMTP id gl10so3778899lab.12
 for <multiple recipients>; Sun, 17 Aug 2014 07:50:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=sender:date:from:to:cc:subject:message-id:mail-followup-to
 :references:mime-version:content-type:content-disposition
 :content-transfer-encoding:in-reply-to:user-agent;
 bh=MuyiV+EyhnLHm8Dvo+77CunOnSeiifyyhLrf7a/5Jgg=;
 b=ttR4imc4RpEzf6NY6OlPFFeVXIIGB+WEa7zyF/ctmIMEVRfj/HXOCozQdP23cjEDbj
 t6RQ3YFIybUV+BW72V2eLXrm9IlOM6FCDkRZEPZE3WMIQ3tXy5XaqwlHnxlBJNmf2j2Z
 G36qTMTyD2Hk552cReuFhbjxhtqdNyRmQVfi7NuJ61u8moaf/wYJ5/19XA9ALrb1fjwv
 FZpIwJdHhwRMfgc84PjNEddAQruJneeVHCm9VV79BNJBoRfNysPQNq3+WmEWGVEceJk6
 3EI/LfS1ClXt/ZY6XcqkySEZZBe6BaN+aICrffrF68zw3paqhbTo4rbn7rCnFluHjnUx
 jMvw==
X-Received: by 10.152.164.70 with SMTP id yo6mr23792407lab.2.1408287059767;
 Sun, 17 Aug 2014 07:50:59 -0700 (PDT)
Received: from pc5.home (abpi45.neoplus.adsl.tpnet.pl. [83.8.50.45])
 by mx.google.com with ESMTPSA id h3sm8741756lah.20.2014.08.17.07.50.58
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Sun, 17 Aug 2014 07:50:59 -0700 (PDT)
Sender: =?UTF-8?Q?Edward_Tomasz_Napiera=C5=82a?= <etnapierala@gmail.com>
Date: Sun, 17 Aug 2014 16:50:59 +0200
From: Edward Tomasz =?utf-8?Q?Napiera=C5=82a?= <trasz@FreeBSD.org>
To: Hans Ottevanger <hans@beastielabs.net>
Subject: Re: [CFT] Autofs.
Message-ID: <20140817145059.GA5497@pc5.home>
Mail-Followup-To: Hans Ottevanger <hans@beastielabs.net>,
 freebsd-arch@FreeBSD.org, freebsd-current@FreeBSD.org
References: <20140730071933.GA20122@pc5.home>
 <53F0878E.3000401@beastielabs.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <53F0878E.3000401@beastielabs.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-current@FreeBSD.org, freebsd-arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 17 Aug 2014 14:51:02 -0000

On 0817T1244, Hans Ottevanger wrote:
> On 07/30/14 09:19, Edward Tomasz Napierała wrote:
> >At the link below you will find a patch that adds the new automounter.
> >The patch is against yesterdays 11.0-CURRENT.
> >
> >http://people.freebsd.org/~trasz/autofs-head-20140729.diff
> >
> >Slides that explain the project scope and deliverables are here:
> >
> >http://people.freebsd.org/~trasz/autofs.pdf
> >
> >Testing is welcome.  Please start with manual pages, eg. automount(8).
> >Note that you need not only to rebuild both kernel and world, but also
> >to run mergemaster, to install required /etc files.  To run at startup,
> >add 'autofs_enable="YES"' to /etc/rc.conf.
> >
> >This project is being sponsored by FreeBSD Foundation.
> >
> 
> Hi!
> 
> Great to see a real autofs finally coming to FreeBSD.
> 
> I already did some very cursory testing on a recent 11-CURRENT system
> that I still happened to have and things with at least the /net map
> look quite OK.
> 
> I could do some more extensive testing if I could use some of my
> 10-STABLE systems. I already checked that the patch applies cleanly
> to a recent 10-STABLE (modulo a few offsets) and that both buildworld
> and buildkernel succeed. Should I expect difficulties actually
> running your autofs on 10-STABLE?

No, it should be fine.  Plan is to MFC this to 10 soon, btw.

> And do you plan support for NIS? I know NIS is quite dead and has
> been so for at least 20 years, but I still see it being used
> occasionally (probably most out of habit) and it is (still ?)
> available in the base-system.

It should be trivial to add, I just need someone with such setup
(autofs maps in NIS) to test it against.


From owner-freebsd-arch@FreeBSD.ORG  Sun Aug 17 14:52:20 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id D917C66B;
 Sun, 17 Aug 2014 14:52:20 +0000 (UTC)
Received: from mail-lb0-x229.google.com (mail-lb0-x229.google.com
 [IPv6:2a00:1450:4010:c04::229])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 2097B20E6;
 Sun, 17 Aug 2014 14:52:19 +0000 (UTC)
Received: by mail-lb0-f169.google.com with SMTP id s7so3359793lbd.0
 for <multiple recipients>; Sun, 17 Aug 2014 07:52:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=sender:date:from:to:cc:subject:message-id:mail-followup-to
 :references:mime-version:content-type:content-disposition
 :content-transfer-encoding:in-reply-to:user-agent;
 bh=KgsSmmtsQW129B+GOjTcU5ln0byYKGpxFwQg3XansU8=;
 b=ZK77mR/i3rd1yiATY76LdZ2Cm2ZOlcmSGyic1KcsWO5AAIQNKLYs+e2xk1cB3zreIi
 3ClAMLuC2nZi9AaZbLsJyakZ0vP6mxy51d9ZQqL6m2qFPCqhLV8WRW5Z5JYYtUmssKBI
 Bco0GcuMpJF6CxFnOspHlj/WffWNQxdYZohTZcngQLIt/RT0lydaCWEVbd41Pv96Iy02
 P/8z7Ov1BFs7LiA62KRx2FdqJ7hZxu936u9M/lCaioSPRYDPwSAuCcf7EjUIcrkr5MrF
 mckGsvQcGl9OTebNgUnu2dPuPKHfPmzesr1C2tDUl7yzoE77QHWgzasjVpkfYWwmbPGg
 3h8Q==
X-Received: by 10.112.52.225 with SMTP id w1mr23001264lbo.44.1408287137869;
 Sun, 17 Aug 2014 07:52:17 -0700 (PDT)
Received: from pc5.home (abpi45.neoplus.adsl.tpnet.pl. [83.8.50.45])
 by mx.google.com with ESMTPSA id yn1sm22592200lbb.25.2014.08.17.07.52.16
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Sun, 17 Aug 2014 07:52:17 -0700 (PDT)
Sender: =?UTF-8?Q?Edward_Tomasz_Napiera=C5=82a?= <etnapierala@gmail.com>
Date: Sun, 17 Aug 2014 16:52:17 +0200
From: Edward Tomasz =?utf-8?Q?Napiera=C5=82a?= <trasz@FreeBSD.org>
To: "O. Hartmann" <ohartman@zedat.fu-berlin.de>
Subject: Re: [CFT] Autofs.
Message-ID: <20140817145217.GB5497@pc5.home>
Mail-Followup-To: "O. Hartmann" <ohartman@zedat.fu-berlin.de>,
 Hans Ottevanger <hans@beastielabs.net>, freebsd-current@FreeBSD.org,
 freebsd-arch@FreeBSD.org
References: <20140730071933.GA20122@pc5.home>
 <53F0878E.3000401@beastielabs.net>
 <20140817152254.1e2786db.ohartman@zedat.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20140817152254.1e2786db.ohartman@zedat.fu-berlin.de>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-current@FreeBSD.org, Hans Ottevanger <hans@beastielabs.net>,
 freebsd-arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 17 Aug 2014 14:52:21 -0000

On 0817T1522, O. Hartmann wrote:
> Am Sun, 17 Aug 2014 12:44:30 +0200
> Hans Ottevanger <hans@beastielabs.net> schrieb:
> 
> > On 07/30/14 09:19, Edward Tomasz Napierała wrote:
> > > At the link below you will find a patch that adds the new automounter.
> > > The patch is against yesterdays 11.0-CURRENT.
> > >
> > > http://people.freebsd.org/~trasz/autofs-head-20140729.diff
> > >
> > > Slides that explain the project scope and deliverables are here:
> > >
> > > http://people.freebsd.org/~trasz/autofs.pdf
> > >
> > > Testing is welcome.  Please start with manual pages, eg. automount(8).
> > > Note that you need not only to rebuild both kernel and world, but also
> > > to run mergemaster, to install required /etc files.  To run at startup,
> > > add 'autofs_enable="YES"' to /etc/rc.conf.
> > >
> > > This project is being sponsored by FreeBSD Foundation.
> > >
> > 
> > Hi!
> > 
> > Great to see a real autofs finally coming to FreeBSD.
> > 
> > I already did some very cursory testing on a recent 11-CURRENT system 
> > that I still happened to have and things with at least the /net map look 
> > quite OK.
> > 
> > I could do some more extensive testing if I could use some of my 
> > 10-STABLE systems. I already checked that the patch applies cleanly to a 
> > recent 10-STABLE (modulo a few offsets) and that both buildworld and 
> > buildkernel succeed. Should I expect difficulties actually running your 
> > autofs on 10-STABLE?
> > 
> > And do you plan support for NIS? I know NIS is quite dead and has been 
> > so for at least 20 years, but I still see it being used occasionally 
> > (probably most out of habit) and it is (still ?) available in the 
> > base-system.
> > 
> > Kind regards,
> > 
> > Hans
> > 
> > _______________________________________________
> > freebsd-current@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
> 
> Is this "new" autofs of the same type and concept as the autofs used in Linux for more
> than a decade now?

Yes.


From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 06:54:25 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A5078738;
 Mon, 18 Aug 2014 06:54:25 +0000 (UTC)
Received: from mailout05.t-online.de (mailout05.t-online.de [194.25.134.82])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "mailout00.t-online.de",
 Issuer "TeleSec ServerPass DE-1" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 39AFF3276;
 Mon, 18 Aug 2014 06:54:25 +0000 (UTC)
Received: from fwd33.aul.t-online.de (fwd33.aul.t-online.de [172.20.27.144])
 by mailout05.t-online.de (Postfix) with SMTP id C59C046F5AD;
 Mon, 18 Aug 2014 08:54:16 +0200 (CEST)
Received: from [192.168.119.33]
 (XRes1UZOYhfIVQzekWzmQNycC9jB7ZSj-I3wVmnK-xvkQbjr-ke7bYQq8f8AFS0gWg@[84.154.101.219])
 by fwd33.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-SHA encrypted)
 esmtp id 1XJGpY-1anVSa0; Mon, 18 Aug 2014 08:54:12 +0200
Message-ID: <53F1A311.4080707@freebsd.org>
Date: Mon, 18 Aug 2014 08:54:09 +0200
From: Stefan Esser <se@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:31.0) Gecko/20100101 Thunderbird/31.0
MIME-Version: 1.0
To: Phil Shafer <phil@juniper.net>, Alfred Perlstein <bright@mu.org>
Subject: Re: XML Output: libxo - provide single API to output TXT, XML, JSON
 and HTML
References: <201408151613.s7FGDMmt003567@idle.juniper.net>
In-Reply-To: <201408151613.s7FGDMmt003567@idle.juniper.net>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-ID: XRes1UZOYhfIVQzekWzmQNycC9jB7ZSj-I3wVmnK-xvkQbjr-ke7bYQq8f8AFS0gWg
X-TOI-MSGID: 3b53171b-f196-43fa-9020-dc778cab534f
Cc: Marcel Moolenaar <marcel@freebsd.org>, John-Mark Gurney <jmg@funkthat.com>,
 "Simon J. Gerraty" <sjg@juniper.net>, "arch@freebsd.org" <arch@freebsd.org>,
 Poul-Henning Kamp <phk@phk.freebsd.dk>,
 Konstantin Belousov <kostikbel@gmail.com>,
 Marcel Moolenaar <marcel@xcllnt.net>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 06:54:25 -0000

Am 15.08.2014 um 18:13 schrieb Phil Shafer:
> Alfred Perlstein writes:
>> Can someone explain an actual use case here that makes sense?
> 
> In JUNOS, we support a NETCONF API, allowing NETCONF RPCs (in XML)
> to get hierarchical data back (in XML).  We use this to automate
> management of our devices.  When we parse RPCs, we construct command
> lines that are invoked.
> 
> For example the "show interfaces terse" command in in the CLI is
> available as the <get-interface-information> RPC with the <terse/>
> option.  The JUNOS CLI parses either of these into the comand line
> "ifinfo -b".
> 
> We currently are told which commands support XML output and which
> don't.  For those that do, we simply forward the command's output
> to the client.  For those that don't we wrap the output in an XML
> tag that means "we don't support this in XML yet, but here's the
> text" (and escape the data).

Is it possible to introduce a "xo" command which takes a command
line as an argument (in the same way as e.g. "time").  A sample
usage could be "xo ls -s", which should invoke "ls -l" with its
output converted to XML (and "xo -json ls -l" could produce JSON
output).


This command is meant to decouple the request for XO support from
the method that checks for XO support and enables it.

If "xo" determines, that "ls" cannot produce structured output,
it executes it as a sub-command and wraps the output in the way
you describe.  This may not be parseable by a following command
in a pipe, and you could add an "-f" option to "xo" that checks
for XO support and makes the command fail if it is not supported
(instead of wrapping up the result).


The downside is the extra process invocation required for "xo",
but you could use any of the suggested methods to check for and
enable support of XO in programs, and you could change that method
at a later time without breaking existing scripts.


Methods discussed so far are e.g.:

- add long option as ARGV[1] (e.g. "--libxo-is-supported")

- use command name prefix ("xo-$CMD" linked to the actual $CMD)

- test for and use different standard file descriptors (XO_STDIN,
  XO_STDOUT, and XO_STDERR) if supported by the command

(I have probably forgotten a few ...)


If you go for "xo [options] cmd", any of the above mechanisms can
be used and the actual method can be changed at a later time.

And further options (e.g. to control the output format - XML vs. JSON,
for example) could also be passed by any method (e.g via an environment
variable checked by libxo).


Anyway: While the command syntax is not important, it should be
stable.  And that's what an "xo" command could provide ...

Regards, STefan

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 08:26:53 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id D173DC60;
 Mon, 18 Aug 2014 08:26:53 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 565513A46;
 Mon, 18 Aug 2014 08:26:53 +0000 (UTC)
Received: from tom.home (kib@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id s7I8QkJr080417
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Mon, 18 Aug 2014 11:26:46 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua s7I8QkJr080417
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id s7I8QkSH080416;
 Mon, 18 Aug 2014 11:26:46 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Mon, 18 Aug 2014 11:26:46 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Subject: Re: [PATCH 1/2] Implement simple sequence counters with memory
 barriers.
Message-ID: <20140818082646.GL2737@kib.kiev.ua>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <1408064112-573-2-git-send-email-mjguzik@gmail.com>
 <20140816093811.GX2737@kib.kiev.ua>
 <20140816185406.GD2737@kib.kiev.ua>
 <20140817012646.GA21025@dft-labs.eu>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="HlXFiQcSFG/a+HqU"
Content-Disposition: inline
In-Reply-To: <20140817012646.GA21025@dft-labs.eu>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
Cc: Johan Schuijt <johan@transip.nl>, freebsd-arch@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 08:26:53 -0000


--HlXFiQcSFG/a+HqU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Aug 17, 2014 at 03:26:47AM +0200, Mateusz Guzik wrote:
> On Sat, Aug 16, 2014 at 09:54:06PM +0300, Konstantin Belousov wrote:
> > On Sat, Aug 16, 2014 at 12:38:11PM +0300, Konstantin Belousov wrote:
> > > On Fri, Aug 15, 2014 at 02:55:11AM +0200, Mateusz Guzik wrote:
> > > > ---
> > > >  sys/sys/seq.h | 126 ++++++++++++++++++++++++++++++++++++++++++++++=
++++++++++++
> > > >  1 file changed, 126 insertions(+)
> > > >  create mode 100644 sys/sys/seq.h
> > > >=20
> > > > diff --git a/sys/sys/seq.h b/sys/sys/seq.h
> > > > new file mode 100644
> > > > index 0000000..0971aef
> > > > --- /dev/null
> > > > +++ b/sys/sys/seq.h
> [..]
> > > > +#ifndef _SYS_SEQ_H_
> > > > +#define _SYS_SEQ_H_
> > > > +
> > > > +#ifdef _KERNEL
> > > > +
> > > > +/*
> > > > + * Typical usage:
> > > > + *
> > > > + * writers:
> > > > + * 	lock_exclusive(&obj->lock);
> > > > + * 	seq_write_begin(&obj->seq);
> > > > + * 	.....
> > > > + * 	seq_write_end(&obj->seq);
> > > > + * 	unlock_exclusive(&obj->unlock);
> > > > + *
> > > > + * readers:
> > > > + * 	obj_t lobj;
> > > > + * 	seq_t seq;
> > > > + *
> > > > + * 	for (;;) {
> > > > + * 		seq =3D seq_read(&gobj->seq);
> > > > + * 		lobj =3D gobj;
> > > > + * 		if (seq_consistent(&gobj->seq, seq))
> > > > + * 			break;
> > > > + * 		cpu_spinwait();
> > > > + * 	}
> > > > + * 	foo(lobj);
> > > > + */	=09
> > > > +
> > > > +typedef uint32_t seq_t;
> > > > +
> > > > +/* A hack to get MPASS macro */
> > > > +#include <sys/systm.h>
> > > > +#include <sys/lock.h>
> > > > +
> > > > +#include <machine/cpu.h>
> > > > +
> > > > +static __inline bool
> > > > +seq_in_modify(seq_t seqp)
> > > > +{
> > > > +
> > > > +	return (seqp & 1);
> > > > +}
> > > > +
> > > > +static __inline void
> > > > +seq_write_begin(seq_t *seqp)
> > > > +{
> > > > +
> > > > +	MPASS(!seq_in_modify(*seqp));
> > > > +	(*seqp)++;
> > > > +	wmb();
> > > This probably ought to be written as atomic_add_rel_int(seqp, 1);
> > Alan Cox rightfully pointed out that better expression is
> > v =3D *seqp + 1;                                                       =
          =20
> > atomic_store_rel_int(seqp, v);
> > which also takes care of TSO on x86.
> >=20
>=20
> Well, my memory-barrier-and-so-on-fu is rather weak.
>=20
> I had another look at the issue. At least on amd64, it looks like only
> compiler barrier is required for both reads and writes.
>=20
> According to AMD64 Architecture Programmer???s Manual Volume 2: System
> Programming, 7.2 Multiprocessor Memory Access Ordering states:
>=20
> "Loads do not pass previous loads (loads are not reordered). Stores do
> not pass previous stores (stores are not reordered)"
>=20
> Since the code modifying stuff only performs a series of writes and we
> expect exclusive writers, I find it applicable to this scenario.
I agree.

>=20
> I checked linux sources and generated assembly, they indeed issue only
> a compiler barrier on amd64 (and for intel processors as well).
>=20
> atomic_store_rel_int on amd64 seems fine in this regard, but the only
> function for loads issues lock cmpxhchg which kills performance
> (median 55693659 -> 12789232 ops in a microbenchmark) for no gain.
>=20
> Additionally release and acquire semantics seems to be a stronger than
> needed guarantee.
>=20
> As far as sequence counters go, we should be able to get away with
> making the following:
> - all relevant reads are performed between given points
> - all relevant writes are performed between given points
>=20
> As such, I propose introducing another atomic_* function variants
> (or stealing smp_{w,r,}mb idea from linux) which provide just that.
>=20
> So for amd64 reading guarantee and writing guarantee could be provided
> in the same way with a compiler barrier.
I think even this could be nicely done in the ia64 style of acq/rel.

>=20
> > > Same note for all other linux-style barriers.  In fact, on x86
> > > wmb() is sfence and it serves no useful purpose in seq_write*.
> > >=20
> > > Overall, it feels too alien and linux-ish for my taste.
> > > Since we have sequence bound to some lock anyway, could we introduce
> > > some sort of generation-aware locks variants, which extend existing
> > > locks, and where lock/unlock bump generation number ?
> > Still, merging it to the guts of lock implementation is right
> > approach, IMO.
> >=20
>=20
> Current usage would be along with filedesc (sx) lock. The lock protects
> writes to entire fd table (and lock holders can block in malloc), while
> each file descriptor has its own counter. Also areas covered by seq are
> short and cannot block.
>=20
> As such, I don't really see any way to merge the lock with the counter.
Ok, I recall my proposal.

>=20
> I agree it would be useful, provided area protected by the lock would be
> the same as the one protected by the counter. If this code hits the tree
> and one day turns out someone needs such functionality, there should not
> be any problems (apart from time effort) in implementing this.
>=20
> --=20
> Mateusz Guzik <mjguzik gmail.com>

--HlXFiQcSFG/a+HqU
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBAgAGBQJT8bjGAAoJEJDCuSvBvK1BuUMP/0rf9cbKqSq8iHnGKIS2ORmZ
Kmt2SMZSEqIEqR/RaVIwvvsCldgV7j2IYHIf74OFaQ/stPWSEJd8ftsDVhylCEIE
XlMrW9W3BjsG224MMpsWXX30dm/iCfPBvKMl9ujJgEY7zpPCUgCIzu9QppJLJhxK
Tk+zLu6fqT8ups7lsQkJLGS1ZhrWTGAQLmvFlGUsTI5lq0yQjKXzgeYLadP29ntx
7q2QbIX1AN7oV/KvM4GpjSmDuUnvpU5OntCcGFtvycX791A8KIhjBIKsZxqE3Snp
Uw6ACdbOfT3i93AkFbM0kx8tSyzyozL6LTUaxPRG9A/H/7NNlivyUh+Ci5QFZtC3
i/BkRY9ty8cisq95EbJm23DtRNWxKq7GsXD/jOudv4BLIZA5T3HXEVNrjeEkFtZn
6EuD8PWh8WHkpxBqIgKy6ZxmxtDGc94ux+ECno5KOiV55hko9nLisgwdsPuAJA9U
WVU989GSpkBrxtmrorDbz7LFmyUVWQ2aY2LBTT3Noy+fukNxLDgsXrjCqrh5YEZW
AjrrmS865vIov0OE+3B7Y2qe140838dLbC00+sUAI7GHBd4/1DZL29BirMJoYAYk
V1a/lNPxhtf8iQcQIGiLI4vbYd1OjXjESiCRAkezUArY/5kRD+3ORQiyto2Uilih
fuEIwgfMENZVp8yi7rDt
=9fvJ
-----END PGP SIGNATURE-----

--HlXFiQcSFG/a+HqU--

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 13:11:51 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 1BFDAFA4;
 Mon, 18 Aug 2014 13:11:51 +0000 (UTC)
Received: from na01-by2-obe.outbound.protection.outlook.com
 (mail-by2lp0243.outbound.protection.outlook.com [207.46.163.243])
 (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits))
 (Client CN "mail.protection.outlook.com",
 Issuer "MSIT Machine Auth CA 2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id A9F7B3737;
 Mon, 18 Aug 2014 13:11:49 +0000 (UTC)
Received: from BY2PR05CA030.namprd05.prod.outlook.com (10.141.250.20) by
 DM2PR05MB736.namprd05.prod.outlook.com (10.141.178.25) with Microsoft SMTP
 Server (TLS) id 15.0.1010.18; Mon, 18 Aug 2014 13:11:46 +0000
Received: from BY2FFO11FD058.protection.gbl (2a01:111:f400:7c0c::107) by
 BY2PR05CA030.outlook.office365.com (2a01:111:e400:2c5f::20) with Microsoft
 SMTP Server (TLS) id 15.0.1010.18 via Frontend Transport; Mon, 18 Aug 2014
 13:11:46 +0000
Received: from P-EMF01-SAC.jnpr.net (66.129.239.15) by
 BY2FFO11FD058.mail.protection.outlook.com (10.1.15.178) with Microsoft SMTP
 Server (TLS) id 15.0.1010.11 via Frontend Transport; Mon, 18 Aug 2014
 13:11:46 +0000
Received: from magenta.juniper.net (172.17.27.123) by P-EMF01-SAC.jnpr.net
 (172.24.192.21) with Microsoft SMTP Server (TLS) id 14.3.146.0; Mon, 18 Aug
 2014 06:11:45 -0700
Received: from idle.juniper.net (idleski.juniper.net [172.25.4.26])	by
 magenta.juniper.net (8.11.3/8.11.3) with ESMTP id s7IDBbn83317;	Mon, 18 Aug
 2014 06:11:38 -0700 (PDT)	(envelope-from phil@juniper.net)
Received: from idle.juniper.net (localhost [127.0.0.1])	by idle.juniper.net
 (8.14.4/8.14.3) with ESMTP id s7IDBRtD018629; Mon, 18 Aug 2014 09:11:27 -0400
 (EDT)	(envelope-from phil@idle.juniper.net)
Message-ID: <201408181311.s7IDBRtD018629@idle.juniper.net>
To: Stefan Esser <se@freebsd.org>
Subject: Re: XML Output: libxo - provide single API to output TXT, XML,
 JSON and HTML
In-Reply-To: <53F1A311.4080707@freebsd.org>
Date: Mon, 18 Aug 2014 09:11:27 -0400
From: Phil Shafer <phil@juniper.net>
MIME-Version: 1.0
Content-Type: text/plain
X-EOPAttributedMessage: 0
X-Forefront-Antispam-Report: CIP:66.129.239.15; CTRY:US; IPV:NLI; IPV:NLI;
 EFV:NLI; SFV:NSPM;
 SFS:(6009001)(199003)(164054003)(189002)(20776003)(64706001)(47776003)(79102001)(76482001)(15202345003)(87936001)(68736004)(76506005)(92566001)(69596002)(84676001)(46102001)(53416004)(77982001)(15975445006)(92726001)(80022001)(6806004)(21056001)(86362001)(102836001)(85306004)(44976005)(4396001)(97736001)(103666002)(83322001)(19580395003)(83072002)(50466002)(54356999)(50986999)(81342001)(81542001)(99396002)(107046002)(48376002)(106466001)(31966008)(74502001)(81156004)(105596002)(110136001)(95666004)(74662001);
 DIR:OUT; SFP:; SCL:1; SRVR:DM2PR05MB736; H:P-EMF01-SAC.jnpr.net; FPR:; MLV:sfv;
 PTR:InfoDomainNonexistent; A:1; MX:1; LANG:en; 
X-Microsoft-Antispam: BCL:0;PCL:0;RULEID:;UriScan:;
X-Forefront-PRVS: 03077579FF
Received-SPF: SoftFail (protection.outlook.com: domain of transitioning
 juniper.net discourages use of 66.129.239.15 as permitted sender)
Authentication-Results: spf=softfail (sender IP is 66.129.239.15)
 smtp.mailfrom=phil@juniper.net; 
X-OriginatorOrg: juniper.net
Cc: Marcel Moolenaar <marcel@freebsd.org>, John-Mark Gurney <jmg@funkthat.com>,
 Alfred Perlstein <bright@mu.org>, "Simon J. Gerraty" <sjg@juniper.net>,
 "arch@freebsd.org" <arch@freebsd.org>, Poul-Henning Kamp <phk@phk.freebsd.dk>,
 Konstantin Belousov <kostikbel@gmail.com>, Marcel
 Moolenaar <marcel@xcllnt.net>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 13:11:51 -0000

Stefan Esser writes:
>Is it possible to introduce a "xo" command which takes a command
>line as an argument (in the same way as e.g. "time").  A sample
>usage could be "xo ls -s", which should invoke "ls -l" with its
>output converted to XML (and "xo -json ls -l" could produce JSON
>output).

I've implemented the "--libxo" option, in a function called
xo_parse_args(), that it called before getopt* and processes and
removes libxo options.  See the example on:

http://juniper.github.io/libxo/libxo-manual.html

FWIW, there's an "xo" command packaged with libxo that perform
similar to the printf(1) command:

% xo --wrap top/data 'My {:pet} is {:age} years old\n' dog 2
My dog is 2 years old
% xo --xml --pretty --wrap top/data 'My {:pet} is {:age} years old\n' dog 2
<top>
  <data>
    <pet>dog</pet>
    <age>2</age>
  </data>
</top>

Thanks,
 Phil

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 15:03:22 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id E44C8D4E;
 Mon, 18 Aug 2014 15:03:22 +0000 (UTC)
Received: from mail.ipfw.ru (mail.ipfw.ru [IPv6:2a01:4f8:120:6141::2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id A923831E8;
 Mon, 18 Aug 2014 15:03:22 +0000 (UTC)
Received: from [2a02:6b8:0:401:222:4dff:fe50:cd2f] (helo=ptichko.yndx.net)
 by mail.ipfw.ru with esmtpsa (TLSv1:DHE-RSA-AES128-SHA:128)
 (Exim 4.82 (FreeBSD)) (envelope-from <melifaro@FreeBSD.org>)
 id 1XJKVJ-0009pe-AL; Mon, 18 Aug 2014 14:49:33 +0400
Message-ID: <53F215A9.8010708@FreeBSD.org>
Date: Mon, 18 Aug 2014 19:03:05 +0400
From: "Alexander V. Chernikov" <melifaro@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: arch@freebsd.org
Subject: superpages for UMA
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "Andrey V. Elsukov" <ae@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 15:03:23 -0000

Hello list.

Currently UMA(9) uses PAGE_SIZE kegs to store items in.
It seems fine for most usage scenarios,  however there are some where 
very large number of items is required.

I've run into this problem while using ipfw tables (radix based) with 
~50k records. This is how
`pmcstat -TS DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK -w1` looks like:
PMC: [DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK] Samples: 2359 (100.0%) , 0 
unresolved

%SAMP IMAGE      FUNCTION             CALLERS
  28.7 kernel     rn_match             ipfw_lookup_table:21.7 
rtalloc_fib_nolock:7.0
  25.5 ipfw.ko    ipfw_chk             ipfw_check_hook
   6.0 kernel     rn_lookup            ipfw_lookup_table

Some numbers: table entry occupies 128 bytes, so we may store no more 
than 30 records in single page-sized keg.
50k records require more than 1500 kegs.
As far as I understand second-level TLB for modern Intel CPU may be 256 
or 512 entries( for 4K pages ), so using large number of entries
results in TLB cache misses constantly happening.

Other examples:
Route tables (in current implementation): struct rte occupies more than 
128 bytes and storing full-view (> 500k routes) would result in TLB 
misses happening all of the time.
Various stateful packet processing: modern SLB/firewall can have 
millions of states. Regardless of state size PAGE_SIZE'd kegs is not the 
best choice.

All of these can be addressed:
Ipwa tables/ipfw dynamic state allocation code can (and will) be 
rewritten to use uma+uma_zone_set_allocf (suggested by glebius),
radix should simply be changed to a different lookup algo (as it is 
happening in ipfw tables).

However, we may consider on adding another UMA flag to allocate 
2M/1G-sized kegs per request.
(Additionally, Intel Haswell arch has 512 entries in STLB shared? 
between 4k/2M so it should help the former).

What do you think?






From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 17:36:51 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 8E3D4F56;
 Mon, 18 Aug 2014 17:36:51 +0000 (UTC)
Received: from dmz-mailsec-scanner-3.mit.edu (dmz-mailsec-scanner-3.mit.edu
 [18.9.25.14])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 893F73FF9;
 Mon, 18 Aug 2014 17:36:50 +0000 (UTC)
X-AuditID: 1209190e-f79946d000007db1-65-53f239ab47e0
Received: from mailhub-auth-2.mit.edu ( [18.7.62.36])
 (using TLS with cipher AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by dmz-mailsec-scanner-3.mit.edu (Symantec Messaging Gateway) with SMTP id
 9C.D4.32177.BA932F35; Mon, 18 Aug 2014 13:36:43 -0400 (EDT)
Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11])
 by mailhub-auth-2.mit.edu (8.13.8/8.9.2) with ESMTP id s7IHagTV027334;
 Mon, 18 Aug 2014 13:36:42 -0400
Received: from multics.mit.edu (system-low-sipb.mit.edu [18.187.2.37])
 (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU)
 by outgoing.mit.edu (8.13.8/8.12.4) with ESMTP id s7IHaeq1028756
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT);
 Mon, 18 Aug 2014 13:36:41 -0400
Received: (from kaduk@localhost) by multics.mit.edu (8.12.9.20060308)
 id s7IHaeOC014040; Mon, 18 Aug 2014 13:36:40 -0400 (EDT)
Date: Mon, 18 Aug 2014 13:36:39 -0400 (EDT)
From: Benjamin Kaduk <kaduk@MIT.EDU>
To: Stefan Esser <se@freebsd.org>
Subject: Re: XML Output: libxo - provide single API to output TXT, XML, JSON
 and HTML
In-Reply-To: <53F1A311.4080707@freebsd.org>
Message-ID: <alpine.GSO.1.10.1408181335320.21571@multics.mit.edu>
References: <201408151613.s7FGDMmt003567@idle.juniper.net>
 <53F1A311.4080707@freebsd.org>
User-Agent: Alpine 1.10 (GSO 962 2008-03-14)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrPIsWRmVeSWpSXmKPExsUixG6norva8lOwwdKd4hZLZsxjtlhyZj27
 xYw7T1gcmD1mfJrP4nG96Sp7AFMUl01Kak5mWWqRvl0CV8bvJV+ZC5pZKxYs+cfSwPiPuYuR
 k0NCwETi5/QTLBC2mMSFe+vZuhi5OIQEZjNJNPYsZYFwNjJKnFlzhwnCOcQkcfFnF1RZA6PE
 9N8b2UD6WQS0JZ6du8cKYrMJqEjMfAMRFxFQlFgw6SATiM0s4Clx4eljRhBbWCBc4uasqWD1
 nEC9R5dvZwexeQUcJc6vXAdmCwlESjxs+A7WKyqgI7F6/xQWiBpBiZMzn7BAzNSSWD59G8sE
 RsFZSFKzkKQWMDKtYpRNya3SzU3MzClOTdYtTk7My0st0jXWy80s0UtNKd3ECApbTkm+HYxf
 DyodYhTgYFTi4T358WOwEGtiWXFl7iFGSQ4mJVFeZYNPwUJ8SfkplRmJxRnxRaU5qcWHGCU4
 mJVEeBNMgXK8KYmVValF+TApaQ4WJXHet9ZWwUIC6YklqdmpqQWpRTBZGQ4OJQneYAugRsGi
 1PTUirTMnBKENBMHJ8hwHqDhN8CGFxck5hZnpkPkTzHqcrQ0ve1lEmLJy89LlRLnPWQOVCQA
 UpRRmgc3B5ZuXjGKA70lzFsJso4HmKrgJr0CWsIEtGTr4o8gS0oSEVJSDYyLy4P2bZ+gHrD4
 167nC9aIaX1xfHJ3l/m0Ba92Jk0yWBI14ekO4ZA7FZO8eDsnTpecHp02u8HwZltg5g2Gx3LM
 l3+5TJUSWBP+/URZx8a1b37Ps2czz7Iw+M/ic0Os7H36uaAdtnVimQI97yYY+Wrsk7128MeF
 pk599yYPo+5I4ex7LP9CdP2VWIozEg21mIuKEwGAmSADEgMAAA==
Cc: "arch@freebsd.org" <arch@freebsd.org>, Phil Shafer <phil@juniper.net>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 17:36:51 -0000

On Mon, 18 Aug 2014, Stefan Esser wrote:

> Methods discussed so far are e.g.:
>
> - add long option as ARGV[1] (e.g. "--libxo-is-supported")
>
> - use command name prefix ("xo-$CMD" linked to the actual $CMD)
>
> - test for and use different standard file descriptors (XO_STDIN,
>   XO_STDOUT, and XO_STDERR) if supported by the command
>
> (I have probably forgotten a few ...)

It seems prudent to consider how well such mechanisms would play with
other libraries attempting to perform similar tricks with regard to
detecting functionality.  E.g., the "xo-" prefix can really only be used
by one library at a time.

-Ben

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 18:39:31 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 37D0DDAD;
 Mon, 18 Aug 2014 18:39:31 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id B0AA03698;
 Mon, 18 Aug 2014 18:39:30 +0000 (UTC)
Received: from tom.home (kib@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id s7IIdPeD099532
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Mon, 18 Aug 2014 21:39:25 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua s7IIdPeD099532
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id s7IIdP4g099531;
 Mon, 18 Aug 2014 21:39:25 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Mon, 18 Aug 2014 21:39:25 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: "Alexander V. Chernikov" <melifaro@FreeBSD.org>
Subject: Re: superpages for UMA
Message-ID: <20140818183925.GP2737@kib.kiev.ua>
References: <53F215A9.8010708@FreeBSD.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="y8hmAOsilT9lKboI"
Content-Disposition: inline
In-Reply-To: <53F215A9.8010708@FreeBSD.org>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
Cc: arch@freebsd.org, Gleb Smirnoff <glebius@freebsd.org>,
 "Andrey V. Elsukov" <ae@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 18:39:31 -0000


--y8hmAOsilT9lKboI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Aug 18, 2014 at 07:03:05PM +0400, Alexander V. Chernikov wrote:
> Hello list.
>=20
> Currently UMA(9) uses PAGE_SIZE kegs to store items in.
> It seems fine for most usage scenarios,  however there are some where=20
> very large number of items is required.
>=20
> I've run into this problem while using ipfw tables (radix based) with=20
> ~50k records. This is how
> `pmcstat -TS DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK -w1` looks like:
> PMC: [DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK] Samples: 2359 (100.0%) , 0=20
> unresolved
>=20
> %SAMP IMAGE      FUNCTION             CALLERS
>   28.7 kernel     rn_match             ipfw_lookup_table:21.7=20
> rtalloc_fib_nolock:7.0
>   25.5 ipfw.ko    ipfw_chk             ipfw_check_hook
>    6.0 kernel     rn_lookup            ipfw_lookup_table
>=20
> Some numbers: table entry occupies 128 bytes, so we may store no more=20
> than 30 records in single page-sized keg.
> 50k records require more than 1500 kegs.
> As far as I understand second-level TLB for modern Intel CPU may be 256=
=20
> or 512 entries( for 4K pages ), so using large number of entries
> results in TLB cache misses constantly happening.
>=20
> Other examples:
> Route tables (in current implementation): struct rte occupies more than=
=20
> 128 bytes and storing full-view (> 500k routes) would result in TLB=20
> misses happening all of the time.
> Various stateful packet processing: modern SLB/firewall can have=20
> millions of states. Regardless of state size PAGE_SIZE'd kegs is not the=
=20
> best choice.
>=20
> All of these can be addressed:
> Ipwa tables/ipfw dynamic state allocation code can (and will) be=20
> rewritten to use uma+uma_zone_set_allocf (suggested by glebius),
> radix should simply be changed to a different lookup algo (as it is=20
> happening in ipfw tables).
>=20
> However, we may consider on adding another UMA flag to allocate=20
> 2M/1G-sized kegs per request.
> (Additionally, Intel Haswell arch has 512 entries in STLB shared?=20
> between 4k/2M so it should help the former).
>=20
> What do you think?
>=20
Zones with small object sizes use uma_small_alloc() to request physical
page and its KVA mapping. On amd64, uma_small_alloc() allocates a
physical page and returns direct mapping address for the page. The
direct map is done by large pages (2MB, 1GB if avaliable). In this
sense, your allocations already use large pages for virtual memory
translations.

Zones are not local in the KVA, i.e. objects from the same zone are
usually far apart in the KVA.  Zones do not get dedicated submaps to
contain the zone-owned pages.

Note that large pages TLB is usually relatively small.  E.g. on my
Nehalem machine, it only has 32 entries which can hold 2MB pages,
which results in the 64MB of cached address space translations in
the best case.  You might try to reduce the available memory to
see the increased locality and better DTLB hit ratio, if your load
can survive with lesser memory size.

--y8hmAOsilT9lKboI
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBAgAGBQJT8khdAAoJEJDCuSvBvK1BhjQP/R565J1uLGZorgaLL9g8Vmkb
2+NsiNyxtRqEUkOQu5mvtuJrRFfHhshQlnyu1mya5710Y4JndIGsUKiiSSot/zSe
81833zvmOWE0MKJ7vVLH7Iw/PgOM+7obWm7QxuiLgLrOW/HJOdwZWABm0dw1zdIU
eu249sF4F4OhRzxBilV5jCb2m8iIRc90St07eBz+441p3xR+ZgVpBQAlQiODAV+j
4CpxpxQrvBWqhdCOKISnKMiOi2rIx4NUz5SdVXF3EjfvV40WWkMuwSnTc4jNMO7p
qY53ChGfcKsfx2CKwpzfrSPZ8wStk5s1hmryoCHEIffzyKRrnQ5Yy+ksOT+fFoe3
OW5GSbDKE+3pgEsPqwuuLhLciX1rZ9LWFoCesciVWqh9er5n3CT5XjllN3wFRGyb
s79uUsBBc4Yk+mowgyzwtGZTzIZTLtXkkVochHwDCRB5IhvWFWWyJ0heVN/mwaI3
3KlmN5JMsv+XXGO0WV/h8qVdIzlvXzbmZqXeuLoX7YbRvpjyckxsAG1UJGqTDNPx
nsCZwLZqpb7oJ0xXvdkbj1Gl3P35sa4YVNaPiY2T9JwdyWMQ88hz2U+D7xr4zw1E
HFFFka76CUWIKoInOW54vQOZhAayq24Sy7hUJeq01Zd+GCFHfo1Kahs0mG0jPtPU
ZBlEZoHQzvXyj49i/fiq
=K3wE
-----END PGP SIGNATURE-----

--y8hmAOsilT9lKboI--

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 19:45:27 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 4966461B;
 Mon, 18 Aug 2014 19:45:27 +0000 (UTC)
Received: from mail-qc0-x230.google.com (mail-qc0-x230.google.com
 [IPv6:2607:f8b0:400d:c01::230])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id CB8B73E3C;
 Mon, 18 Aug 2014 19:45:26 +0000 (UTC)
Received: by mail-qc0-f176.google.com with SMTP id m20so5351593qcx.35
 for <multiple recipients>; Mon, 18 Aug 2014 12:45:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=JaebXXZUm9b4PmO3YqhfcQsFOSMkpoJPDSei3E/JaWo=;
 b=LV6sTVs2D9YIIjyXUmBIR1d/M3yrVo8Av+W1eRZVylLvNZrIWcndMvz5Cv9qr3+K+8
 bMprjgLqrcPLmGInynn/HO3qWd3uH8RLJAg8CM/DovlSup1kSJQ2Ac98ISmuY6ML+qR4
 BJW6pY+0zrQ3dANRZsJ6mWSC2MZnYMocApM0wgzNslgwwbUg42PG+mD5ELZeh/BQvcKE
 Yt1dk5ZxxLvneTI5zDqHPoL2SubrZD2wAujtcHUkKt5L8R/cAkh2hhbJdjaoSGm3w5+B
 AI7HSzS3+iV3DxXdMT9/vjjWc72ozHaPNcnsZjADgSP6RCqR1Y1eNBNxq62MCyctrRl2
 7W+Q==
MIME-Version: 1.0
X-Received: by 10.140.27.144 with SMTP id 16mr55305116qgx.18.1408391125698;
 Mon, 18 Aug 2014 12:45:25 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.224.39.139 with HTTP; Mon, 18 Aug 2014 12:45:25 -0700 (PDT)
In-Reply-To: <20140818183925.GP2737@kib.kiev.ua>
References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua>
Date: Mon, 18 Aug 2014 12:45:25 -0700
X-Google-Sender-Auth: mTqPAID1-WA3Y8GAwUvn97dr36g
Message-ID: <CAJ-VmokXDtXSqCKo9wNVeqd9yQeZwgvjWqPA4tQnydx+0W_Gzg@mail.gmail.com>
Subject: Re: superpages for UMA
From: Adrian Chadd <adrian@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: "freebsd-arch@freebsd.org" <arch@freebsd.org>,
 Gleb Smirnoff <glebius@freebsd.org>,
 "Alexander V. Chernikov" <melifaro@freebsd.org>,
 "Andrey V. Elsukov" <ae@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 19:45:27 -0000

Hi!

I dug into this a little bit last year. I saw a lot of time spent just
walking TLBs for VM pages when doing a lot of VM page -> network
pushing.

On the sandy bridge boxes with 1G page entries, the TLB only has 4 entries.

The high area of memory isn't 1G aligned, so we don't use 1G pages for
all the stuff that's allocated initially. That includes, among other
things, all the VM memory that you need.

The other thing that crept up was that we don't try to reserve memory
in any way - we'll just fragment stuff quickly from the pmap and
allocate where we can when we can. So there's currently no attempt to
allocate small kernel structures from the same underlying 1G page.

That'd be an interesting experiment - allocating VM entries and other
small things like rtentry and mbuf UMA entries from one or two 1GB
regions of memory. It may make better use of the 1G (or 2M) TLB
entries and keep things hot.



-a

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 19:48:47 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 934CB763;
 Mon, 18 Aug 2014 19:48:47 +0000 (UTC)
Received: from mail-ie0-x230.google.com (mail-ie0-x230.google.com
 [IPv6:2607:f8b0:4001:c03::230])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 43BD83E65;
 Mon, 18 Aug 2014 19:48:47 +0000 (UTC)
Received: by mail-ie0-f176.google.com with SMTP id tr6so17594ieb.21
 for <multiple recipients>; Mon, 18 Aug 2014 12:48:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:reply-to:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=84IbjVCLMhvLluqidC+1Sq2vMOVqQN5PnM7dgeoS+S4=;
 b=ljddjmezHjNN2ZmvERtC6ydVvSyEQZ3uTqoVwVHnY+BV5OlDxhg4I8EzhmRbJ7/NTV
 mHK7QHoWO+v4g+KbIm+J9BXIUqSgzZYN2qqOjN2hjdFtTH8OZltrtgoVxGDgbcHnlN7V
 1hKUuHfqQ0QMJri/iVbYGdp3IJgVdr2AjpOm1BcycbfcDyjZQN6YzuaZftSrTlfv+7hO
 TjWn6w6NAcDuqJQ2/eH6b0EtUPI5bbrZKmtcWJa8R7K1vvt+nbcWE8Kszxg3lXa8BrSh
 9efkENTkZJUDVTBgJfol9zwhoN16lhT8qYPS40f2ObE6qER87MyHlADFIiD6mTiZkP+N
 bonQ==
MIME-Version: 1.0
X-Received: by 10.42.171.138 with SMTP id j10mr3073695icz.75.1408391326660;
 Mon, 18 Aug 2014 12:48:46 -0700 (PDT)
Received: by 10.43.17.196 with HTTP; Mon, 18 Aug 2014 12:48:46 -0700 (PDT)
Reply-To: alc@freebsd.org
In-Reply-To: <20140818183925.GP2737@kib.kiev.ua>
References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua>
Date: Mon, 18 Aug 2014 14:48:46 -0500
Message-ID: <CAJUyCcM7ZipmYu8OLxT2TCPjS+CSTGPRnotdKgchoNQH8s8ndA@mail.gmail.com>
Subject: Re: superpages for UMA
From: Alan Cox <alan.l.cox@gmail.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1
Cc: arch@freebsd.org, Gleb Smirnoff <glebius@freebsd.org>,
 "Alexander V. Chernikov" <melifaro@freebsd.org>,
 "Andrey V. Elsukov" <ae@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 19:48:47 -0000

On Mon, Aug 18, 2014 at 1:39 PM, Konstantin Belousov <kostikbel@gmail.com>
wrote:

> On Mon, Aug 18, 2014 at 07:03:05PM +0400, Alexander V. Chernikov wrote:
> > Hello list.
> >
> > Currently UMA(9) uses PAGE_SIZE kegs to store items in.
> > It seems fine for most usage scenarios,  however there are some where
> > very large number of items is required.
> >
> > I've run into this problem while using ipfw tables (radix based) with
> > ~50k records. This is how
> > `pmcstat -TS DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK -w1` looks like:
> > PMC: [DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK] Samples: 2359 (100.0%) , 0
> > unresolved
> >
> > %SAMP IMAGE      FUNCTION             CALLERS
> >   28.7 kernel     rn_match             ipfw_lookup_table:21.7
> > rtalloc_fib_nolock:7.0
> >   25.5 ipfw.ko    ipfw_chk             ipfw_check_hook
> >    6.0 kernel     rn_lookup            ipfw_lookup_table
> >
> > Some numbers: table entry occupies 128 bytes, so we may store no more
> > than 30 records in single page-sized keg.
> > 50k records require more than 1500 kegs.
> > As far as I understand second-level TLB for modern Intel CPU may be 256
> > or 512 entries( for 4K pages ), so using large number of entries
> > results in TLB cache misses constantly happening.
> >
> > Other examples:
> > Route tables (in current implementation): struct rte occupies more than
> > 128 bytes and storing full-view (> 500k routes) would result in TLB
> > misses happening all of the time.
> > Various stateful packet processing: modern SLB/firewall can have
> > millions of states. Regardless of state size PAGE_SIZE'd kegs is not the
> > best choice.
> >
> > All of these can be addressed:
> > Ipwa tables/ipfw dynamic state allocation code can (and will) be
> > rewritten to use uma+uma_zone_set_allocf (suggested by glebius),
> > radix should simply be changed to a different lookup algo (as it is
> > happening in ipfw tables).
> >
> > However, we may consider on adding another UMA flag to allocate
> > 2M/1G-sized kegs per request.
> > (Additionally, Intel Haswell arch has 512 entries in STLB shared?
> > between 4k/2M so it should help the former).
> >
> > What do you think?
> >
> Zones with small object sizes use uma_small_alloc() to request physical
> page and its KVA mapping. On amd64, uma_small_alloc() allocates a
> physical page and returns direct mapping address for the page. The
> direct map is done by large pages (2MB, 1GB if avaliable). In this
> sense, your allocations already use large pages for virtual memory
> translations.
>
> Zones are not local in the KVA, i.e. objects from the same zone are
> usually far apart in the KVA.  Zones do not get dedicated submaps to
> contain the zone-owned pages.
>
> Note that large pages TLB is usually relatively small.  E.g. on my
> Nehalem machine, it only has 32 entries which can hold 2MB pages,
> which results in the 64MB of cached address space translations in
> the best case.  You might try to reduce the available memory to
> see the increased locality and better DTLB hit ratio, if your load
> can survive with lesser memory size.
>


Newer Intel CPUs have more entries, and AMD CPUs have long (since
Barcelona) had more.  In particular, they allow 2 MB page mappings to be
cached in a larger L2 TLB.  Nowadays, the trouble is with the 1 GB pages.
A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB pages.

It might make sense to increase the largest size used by the buddy
allocator in vm_phys.c to 1 GB.  Then, the VM_FREEPOOL_DIRECT mechanism
might help.  Back in the days when Opteron TLBs had only 8 2MB entries, I
wrote the following in the commit message for r170477:

"The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages.  Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space.  The performance benefits vary.  In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 19:52:13 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 170E5A9A;
 Mon, 18 Aug 2014 19:52:13 +0000 (UTC)
Received: from mail-ie0-x22d.google.com (mail-ie0-x22d.google.com
 [IPv6:2607:f8b0:4001:c03::22d])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id B02993F13;
 Mon, 18 Aug 2014 19:52:12 +0000 (UTC)
Received: by mail-ie0-f173.google.com with SMTP id tr6so21213ieb.32
 for <multiple recipients>; Mon, 18 Aug 2014 12:52:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:reply-to:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=Qe9GhZoEMzzWfGPQC8PbPUvioN19l1P/k59IRCPkbQ8=;
 b=oDRBQqNL1uly0yKgisXzKunZWnAlbWynVWY0IxFP+h5fbXKgThmkLGC8EBoTeQvG4h
 Q8ZaGYkWWUg+4wvtP5EwQkEXY/JZSruoCTCOiseYqldqGtp9vMAVyER9qeBS6PvvRLAi
 qpm5bpXzRtuzLzLhh66wQl43fyUhrRBAvSgvBKRFMxY5gW4GU59Hz57v2k9LwxMm+RR4
 zyILc/p91cxCATheQDxSsjFg9Y8DF0X2zrId0Mwl+qAVlFk9hY4EUp6UvN3Jw6GYqVRf
 DHynrairRSpAcKKa6iX91worxGstCyfWQo68jexA+Einhxl/O0ZixxpVcdpBljXBsomP
 WNGQ==
MIME-Version: 1.0
X-Received: by 10.43.70.66 with SMTP id yf2mr19814257icb.36.1408391532138;
 Mon, 18 Aug 2014 12:52:12 -0700 (PDT)
Received: by 10.43.17.196 with HTTP; Mon, 18 Aug 2014 12:52:12 -0700 (PDT)
Reply-To: alc@freebsd.org
In-Reply-To: <CAJ-VmokXDtXSqCKo9wNVeqd9yQeZwgvjWqPA4tQnydx+0W_Gzg@mail.gmail.com>
References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua>
 <CAJ-VmokXDtXSqCKo9wNVeqd9yQeZwgvjWqPA4tQnydx+0W_Gzg@mail.gmail.com>
Date: Mon, 18 Aug 2014 14:52:12 -0500
Message-ID: <CAJUyCcOSoFwHmeX=fN3deqAZPn-T3EFtCkFje5u8woDz2qoaRw@mail.gmail.com>
Subject: Re: superpages for UMA
From: Alan Cox <alan.l.cox@gmail.com>
To: Adrian Chadd <adrian@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1
Cc: Konstantin Belousov <kostikbel@gmail.com>,
 "freebsd-arch@freebsd.org" <arch@freebsd.org>,
 Gleb Smirnoff <glebius@freebsd.org>,
 "Alexander V. Chernikov" <melifaro@freebsd.org>,
 "Andrey V. Elsukov" <ae@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 19:52:13 -0000

On Mon, Aug 18, 2014 at 2:45 PM, Adrian Chadd <adrian@freebsd.org> wrote:

> Hi!
>
> I dug into this a little bit last year. I saw a lot of time spent just
> walking TLBs for VM pages when doing a lot of VM page -> network
> pushing.
>
> On the sandy bridge boxes with 1G page entries, the TLB only has 4 entries.
>
> The high area of memory isn't 1G aligned, so we don't use 1G pages for
> all the stuff that's allocated initially. That includes, among other
> things, all the VM memory that you need.
>
> The other thing that crept up was that we don't try to reserve memory
> in any way - we'll just fragment stuff quickly from the pmap and
> allocate where we can when we can. So there's currently no attempt to
> allocate small kernel structures from the same underlying 1G page.
>
>
For uma_small_alloc(), there is VM_FREEPOOL_DIRECT.  However, this is still
tuned for 2 MB pages.


> That'd be an interesting experiment - allocating VM entries and other
> small things like rtentry and mbuf UMA entries from one or two 1GB
> regions of memory. It may make better use of the 1G (or 2M) TLB
> entries and keep things hot.
>
>
>
> -a
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 20:13:33 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 2258A1AA;
 Mon, 18 Aug 2014 20:13:33 +0000 (UTC)
Received: from alto.onthenet.com.au (alto.OntheNet.com.au [203.13.68.12])
 by mx1.freebsd.org (Postfix) with ESMTP id D6385313F;
 Mon, 18 Aug 2014 20:13:32 +0000 (UTC)
Received: from dommail.onthenet.com.au (dommail.OntheNet.com.au [203.13.70.57])
 by alto.onthenet.com.au (Postfix) with ESMTPS id 875C11245D;
 Tue, 19 Aug 2014 06:13:24 +1000 (EST)
Received: from Peter-Grehans-MacBook-Pro-2.local ([64.245.0.210])
 by dommail.onthenet.com.au (MOS 4.4.4-GA)
 with ESMTP id BXU20151 (AUTH peterg@ptree32.com.au);
 Tue, 19 Aug 2014 06:13:23 +1000
Message-ID: <53F25E60.5050109@freebsd.org>
Date: Mon, 18 Aug 2014 13:13:20 -0700
From: Peter Grehan <grehan@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: alc@freebsd.org, Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: superpages for UMA
References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua>
 <CAJUyCcM7ZipmYu8OLxT2TCPjS+CSTGPRnotdKgchoNQH8s8ndA@mail.gmail.com>
In-Reply-To: <CAJUyCcM7ZipmYu8OLxT2TCPjS+CSTGPRnotdKgchoNQH8s8ndA@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: arch@freebsd.org, Gleb Smirnoff <glebius@freebsd.org>,
 "Alexander V. Chernikov" <melifaro@freebsd.org>,
 "Andrey V. Elsukov" <ae@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 20:13:33 -0000

> Newer Intel CPUs have more entries, and AMD CPUs have long (since
> Barcelona) had more.  In particular, they allow 2 MB page mappings to be
> cached in a larger L2 TLB.  Nowadays, the trouble is with the 1 GB pages.
> A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB pages.

  There are new(ish) ones effectively without 1GB pages. From the 
"Software Optimization Guide for AMD Family 16h Processors"

"Smashing"
   ...
"when the Family 16h processor encounters a 1-Gbyte page size, it will 
smash translations of that 1-Gbyte region into 2-Mbyte TLB entries, each
of which translates a 2-Mbyte region of the 1-Gbyte page."

later,

Peter.

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 20:26:31 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 217179AF
 for <arch@freebsd.org>; Mon, 18 Aug 2014 20:26:31 +0000 (UTC)
Received: from mail-pd0-f170.google.com (mail-pd0-f170.google.com
 [209.85.192.170])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id DEEA533B7
 for <arch@freebsd.org>; Mon, 18 Aug 2014 20:26:30 +0000 (UTC)
Received: by mail-pd0-f170.google.com with SMTP id g10so8306491pdj.1
 for <arch@freebsd.org>; Mon, 18 Aug 2014 13:26:24 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:sender:content-type:mime-version:subject:from
 :in-reply-to:date:cc:message-id:references:to;
 bh=qUaQfSoyxwXZHjgbkO8snPN3xfU+oyxS+aaJGM92k80=;
 b=i6cfsNql7eahq12qhbfzM3sYY6jeBUUscyn3McEbGMYvUERc0Dc2fsRCHEsIRfbfMM
 1FLtN2O3C1mDB1TiwzEH4dnL8vrkaZ4i8LEdfNSyRmkEot/P68v2iIL3Rp86jSRiT2g6
 YlXPbRX5W2wAcvSubFq58zq3+hKDXpnCHcWd1dLPFTS+5OSZvOzB17H90gP8gWAMcbbw
 kdR7Wf9xAd07QCDJHwcZkxo9Ld+GT4qjfRp6x9H7y7exxiXWX5y5i7nki2R3C2fa8D9m
 1fgWaUuv/LSMFbPht7r2twkUZntFLZ20WCjTKxGTnFr+nRY3lT+g4hN4AVBZ3c5pR1oQ
 yEqA==
X-Gm-Message-State: ALoCoQnTZBly1//yBroe/1oKFc8j1kDiL4TG1c+V6eDM+ItyiegfPO5c4M0+dXedoLD7gVrsbLrK
X-Received: by 10.70.44.70 with SMTP id c6mr36094617pdm.75.1408393584700;
 Mon, 18 Aug 2014 13:26:24 -0700 (PDT)
Received: from lgmac-jku.corp.netflix.com (dc1-prod.netflix.com.
 [69.53.236.251])
 by mx.google.com with ESMTPSA id hk7sm26086705pdb.4.2014.08.18.13.26.23
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Mon, 18 Aug 2014 13:26:23 -0700 (PDT)
Sender: Warner Losh <wlosh@bsdimp.com>
Content-Type: multipart/signed;
 boundary="Apple-Mail=_4C658C27-60E7-4EE6-BC54-329709FB8759";
 protocol="application/pgp-signature"; micalg=pgp-sha512
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Subject: Re: superpages for UMA
From: Warner Losh <imp@bsdimp.com>
In-Reply-To: <53F25E60.5050109@freebsd.org>
Date: Mon, 18 Aug 2014 14:26:21 -0600
Message-Id: <257A0976-7C5E-4029-AF32-BFB3A6C60832@bsdimp.com>
References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua>
 <CAJUyCcM7ZipmYu8OLxT2TCPjS+CSTGPRnotdKgchoNQH8s8ndA@mail.gmail.com>
 <53F25E60.5050109@freebsd.org>
To: Peter Grehan <grehan@freebsd.org>
X-Mailer: Apple Mail (2.1878.6)
Cc: arch@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 20:26:31 -0000


--Apple-Mail=_4C658C27-60E7-4EE6-BC54-329709FB8759
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252


On Aug 18, 2014, at 2:13 PM, Peter Grehan <grehan@freebsd.org> wrote:

>> Newer Intel CPUs have more entries, and AMD CPUs have long (since
>> Barcelona) had more.  In particular, they allow 2 MB page mappings to =
be
>> cached in a larger L2 TLB.  Nowadays, the trouble is with the 1 GB =
pages.
>> A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB =
pages.
>=20
> There are new(ish) ones effectively without 1GB pages. =46rom the =
"Software Optimization Guide for AMD Family 16h Processors"
>=20
> "Smashing"
>  ...
> "when the Family 16h processor encounters a 1-Gbyte page size, it will =
smash translations of that 1-Gbyte region into 2-Mbyte TLB entries, each
> of which translates a 2-Mbyte region of the 1-Gbyte page."

=93we=92ll emulate this feature designed to make things go faster in =
hardware in software by doing the very thing that makes it go slow in =
hardware.=94

Fun times. Performance Smashing!

Warner


--Apple-Mail=_4C658C27-60E7-4EE6-BC54-329709FB8759
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename=signature.asc
Content-Type: application/pgp-signature;
	name=signature.asc
Content-Description: Message signed with OpenPGP using GPGMail

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org

iQIcBAEBCgAGBQJT8mFtAAoJEGwc0Sh9sBEAILYQAJvi/5avR/rBR2VivBhiWVIG
3HjtyIPbTu2XE9OiyF+h4BkREZ9Wu1dyUgKnCKqYM4DPkTGdSAcRGCdSa8GqDYva
xV0QU2JH2DpjXZgmlO5JKYVzDmn/7GJVd5Ix71jg5yneg8kKl4U14ZxXcboLAY36
8t020p6vzIKNkz352kXYqLR/aCle3opbzmXTtq3lMqZHc3UMptq+XIG8m91SlQWc
24CSuJOV1W1rvi0RJ2iFR3KYE9cxvA7iUTd8RsqV5aevc22DZsjBLYRuwaA5Z2uy
xFVflbrv3bA2vxw1GdtJ/W3LiD1oH+GP0jTGHMMG/jmJTlL6JbnhHR3MT0l3Ue57
dsrI24GV0aarjjHx282cyn77RTsrR0N6Kn0mw1usRWYixY/k5JNqbdQoIXB2Fqyx
Mt4Axj3jm9kIjRCJNVx5XCix7md2SU402ac8zXdreD42IvyyXfc6cgWXvd8WNXXK
XdEyvRbQs50ktb5eXBpm9yqsRcOl6d0C0tyP7SaDCevmTn6+405Z6QytK3L9Pc+Y
yWC5hFaBLw/26JFhjF2E7ysfnfH3Nn+jIS5CgmuPzzp+qXYfmRmf5HQyJ01fr0lh
b+tSS4sJV1WOC+tEt/2Joiw3llJYiSO07x4hT/GatVZtk1e4RlRER/AX0suVyF4F
Ry22w2qx+U8yfO094Ef1
=h2+C
-----END PGP SIGNATURE-----

--Apple-Mail=_4C658C27-60E7-4EE6-BC54-329709FB8759--

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 20:39:23 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 3AF41CB1;
 Mon, 18 Aug 2014 20:39:23 +0000 (UTC)
Received: from pp2.rice.edu (proofpoint2.mail.rice.edu [128.42.201.101])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id F250E34A7;
 Mon, 18 Aug 2014 20:39:22 +0000 (UTC)
Received: from pps.filterd (pp2.rice.edu [127.0.0.1])
 by pp2.rice.edu (8.14.5/8.14.5) with SMTP id s7IKacaH028827;
 Mon, 18 Aug 2014 15:39:21 -0500
Received: from mh1.mail.rice.edu (mh1.mail.rice.edu [128.42.201.20])
 by pp2.rice.edu with ESMTP id 1numser5te-1;
 Mon, 18 Aug 2014 15:39:21 -0500
X-Virus-Scanned: by amavis-2.7.0 at mh1.mail.rice.edu, auth channel
Received: from 108-254-203-201.lightspeed.hstntx.sbcglobal.net
 (108-254-203-201.lightspeed.hstntx.sbcglobal.net [108.254.203.201])
 (using TLSv1 with cipher RC4-MD5 (128/128 bits))
 (No client certificate requested) (Authenticated sender: alc)
 by mh1.mail.rice.edu (Postfix) with ESMTPSA id AA3134601D5;
 Mon, 18 Aug 2014 15:39:20 -0500 (CDT)
Message-ID: <53F26477.8050004@rice.edu>
Date: Mon, 18 Aug 2014 15:39:20 -0500
From: Alan Cox <alc@rice.edu>
User-Agent: Mozilla/5.0 (X11; FreeBSD i386;
 rv:24.0) Gecko/20100101 Thunderbird/24.2.0
MIME-Version: 1.0
To: Peter Grehan <grehan@freebsd.org>, alc@freebsd.org,
 Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: superpages for UMA
References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua>
 <CAJUyCcM7ZipmYu8OLxT2TCPjS+CSTGPRnotdKgchoNQH8s8ndA@mail.gmail.com>
 <53F25E60.5050109@freebsd.org>
In-Reply-To: <53F25E60.5050109@freebsd.org>
X-Enigmail-Version: 1.6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0
 kscore.is_bulkscore=0
 kscore.compositescore=0 circleOfTrustscore=0
 compositescore=0.629899992726084 urlsuspect_oldscore=0.0298999927260837
 suspectscore=3 recipient_domain_to_sender_totalscore=0 phishscore=0
 bulkscore=0 kscore.is_spamscore=3.8904595378586e-08
 recipient_to_sender_totalscore=0
 recipient_domain_to_sender_domain_totalscore=498
 rbsscore=0.629899992726084 spamscore=0
 recipient_to_sender_domain_totalscore=0 urlsuspectscore=0.9 adultscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000
 definitions=main-1408180228
Cc: arch@freebsd.org, Gleb Smirnoff <glebius@freebsd.org>,
 "Alexander V. Chernikov" <melifaro@freebsd.org>,
 "Andrey V. Elsukov" <ae@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 20:39:23 -0000

On 08/18/2014 15:13, Peter Grehan wrote:
>> Newer Intel CPUs have more entries, and AMD CPUs have long (since
>> Barcelona) had more.  In particular, they allow 2 MB page mappings to be
>> cached in a larger L2 TLB.  Nowadays, the trouble is with the 1 GB
>> pages.
>> A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB pages.
>
>  There are new(ish) ones effectively without 1GB pages. From the
> "Software Optimization Guide for AMD Family 16h Processors"
>


My recollection is that the first Intel processors to support 1 GB page
mappings did this.  They allowed you set PG_PS on the 1GB PTE, but there
were no actual 1 GB page TLB entries.

Also, after I modified the direct map on amd64 to use 1 GB pages, I
noticed some strange performance anomalies.  Specifically, sometimes
performance was worse than I expected.  It turned out that when the end
of DRAM wasn't aligned to a 1 GB boundary, and the end of DRAM was
mapped with a 1 GB PTE, the TLB would wind up with 4 KB mappings for
anything covered by that last PTE.  Whereas, before, it was at least 2
MB aligned and we would wind up with 2 MB page mappings in the TLB.  So,
now, the direct creation has an awareness of this issue.


> "Smashing"
>   ...
> "when the Family 16h processor encounters a 1-Gbyte page size, it will
> smash translations of that 1-Gbyte region into 2-Mbyte TLB entries, each
> of which translates a 2-Mbyte region of the 1-Gbyte page."
>
> later,
>
> Peter.
>


From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 20:44:43 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A0679E7D;
 Mon, 18 Aug 2014 20:44:43 +0000 (UTC)
Received: from mail-wg0-x230.google.com (mail-wg0-x230.google.com
 [IPv6:2a00:1450:400c:c00::230])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id E2C783560;
 Mon, 18 Aug 2014 20:44:42 +0000 (UTC)
Received: by mail-wg0-f48.google.com with SMTP id x13so5488000wgg.31
 for <multiple recipients>; Mon, 18 Aug 2014 13:44:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=RKbRj9D+jRvOkG2T5WFd2NHXTFG96lDpLM/CBZchTD8=;
 b=Mkw/6+2y3eTJPzcxuoLX4fGff85Kmv9UHSoCsEVwePZu/K7nbBskS7Fj3f588pEE2p
 x62ZkmAFpawgDjv/a9oi7Llm1aI7GcmHyovzSP/Sv4LqqGMaqpkxdlP0+ifwUGB5/IMk
 zjEqFCnWwexe8Foo/aZEly+QViurNI7/o2fZujePly5RsUzV4N1+vmPMmdhMe6+2saO0
 cwEv+DV+0KzAa8gq0gGcE6TY86mzv9RAoqvR8KxPQYfSfxqSE328ydx3c2kGpVlSEV/I
 EuUFX53q4HwtuYHyUgG3dwchJZEXU3bS/w1Q4m3QKey04bEvkQW4WgfLALa8l1H3F8gQ
 0ytw==
MIME-Version: 1.0
X-Received: by 10.180.102.130 with SMTP id fo2mr1450859wib.29.1408394680972;
 Mon, 18 Aug 2014 13:44:40 -0700 (PDT)
Received: by 10.216.160.9 with HTTP; Mon, 18 Aug 2014 13:44:40 -0700 (PDT)
In-Reply-To: <20140711232914.GH41807@pwnie.vrt.sourcefire.com>
References: <20140711232914.GH41807@pwnie.vrt.sourcefire.com>
Date: Mon, 18 Aug 2014 16:44:40 -0400
Message-ID: <CADt0fhx_KMhvRaWuht1GO6WZVW4euUX8WG6eGPD4QnFf8fDM2g@mail.gmail.com>
Subject: Re: [RFC] ASLR Whitepaper and Candidate Final Patch
From: Shawn Webb <lattera@gmail.com>
To: freebsd-arch@freebsd.org
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1
Cc: PaX Team <pageexec@freemail.hu>, Bryan Drewery <bdrewery@freebsd.org>,
 Alan Cox <alc@rice.edu>,
 =?UTF-8?Q?Dag=2DErling_Sm=C3=B8rgrav?= <des@freebsd.org>,
 Oliver Pinter <oliver.pntr@gmail.com>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 20:44:43 -0000

I've uploaded a new patch to Phabric: https://reviews.freebsd.org/D473. I'm
interested in hearing feedback from the community.


On Fri, Jul 11, 2014 at 7:29 PM, Shawn Webb <lattera@gmail.com> wrote:

> Hey All,
>
> Oliver Pinter and I have been working hard on our ASLR implementation.
> We're now in the final stages of development and would like to get
> feedback from the community. I've attached to this email a small
> whitepaper that details our implementation and the accompanying patch.
>
> There is one part of the patch that I wrote that is quite an ugly hack
> and would like to get some feedback on. I added a little hack to
> sys_mmap() to apply ASLR to calls to mmap(2) when MAP_32BIT is
> specified. I'd like to remove that ugly hack to something a bit more
> beautiful, so if anyone has any suggestions, I'm all ears.
>
> Other than that ugly hack, the code adheres to FreeBSD's style(9)
> standards. I believe we have an awesome implementation, one I've
> personally been using without issue for months.
>
> I'm looking forward to your comments and questions. I've CC'd the PaX
> team. Please keep them CC'd in your replies.
>
> Thank you very much,
>
> Shawn Webb
> CC: PaX Team
> CC: Oliver Pinter
> CC: des@freebsd.org
> CC: alc@rice.edu
> CC: bdrewery@freebsd.org
>
> PS - Sorry for the duplicate emails. I hit the wrong key and didn't CC
> everyone.
>

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 18 22:35:53 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id C64FAFD1;
 Mon, 18 Aug 2014 22:35:53 +0000 (UTC)
Received: from mail-ie0-x22e.google.com (mail-ie0-x22e.google.com
 [IPv6:2607:f8b0:4001:c03::22e])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 88F883F43;
 Mon, 18 Aug 2014 22:35:53 +0000 (UTC)
Received: by mail-ie0-f174.google.com with SMTP id rp18so176754iec.5
 for <multiple recipients>; Mon, 18 Aug 2014 15:35:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:reply-to:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=f5LlijFXX7pDpvpWeF6NZZMfcBpCCrtSa9MFRU69x3Y=;
 b=DFW1mSqsSRI1uFdfrYZDbmQBi+Mx9mf0220Vhq7yvK7Z85n2g6+WDSslneqcjwq35x
 nk+/DTSIO+RD3w38eu3uSZhsuyRHGYLQpl8unnB6+YSPzfWuiJPETJsGQk7F3DqryLP7
 JwAm9jDQXaKUl7RmPVEXSuYKAFV+Yzr+Q2jc+PDsw/vAu6RbJmr7eEKuxFHU7nPZkz/4
 2bT1S3rLrRg0ROEiYqo+5ot0zzgAj3WdbePnldg+6Wrdm3J8i20XCDoEGYmWFCcwaqoc
 W57jSrAVxyOPG55MvQq+SSCZoVGUhPMl+0BEEclylHD2rJTq+tbl37/j38ISM3hkobZl
 oOhQ==
MIME-Version: 1.0
X-Received: by 10.43.127.136 with SMTP id ha8mr3526994icc.78.1408401352808;
 Mon, 18 Aug 2014 15:35:52 -0700 (PDT)
Received: by 10.43.17.196 with HTTP; Mon, 18 Aug 2014 15:35:52 -0700 (PDT)
Reply-To: alc@freebsd.org
In-Reply-To: <257A0976-7C5E-4029-AF32-BFB3A6C60832@bsdimp.com>
References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua>
 <CAJUyCcM7ZipmYu8OLxT2TCPjS+CSTGPRnotdKgchoNQH8s8ndA@mail.gmail.com>
 <53F25E60.5050109@freebsd.org>
 <257A0976-7C5E-4029-AF32-BFB3A6C60832@bsdimp.com>
Date: Mon, 18 Aug 2014 17:35:52 -0500
Message-ID: <CAJUyCcM_4-jiJ5PqnmT6H-2qg63nEXmpZ69vGGb6SR0Trp8e0Q@mail.gmail.com>
Subject: Re: superpages for UMA
From: Alan Cox <alan.l.cox@gmail.com>
To: Warner Losh <imp@bsdimp.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1
Cc: "freebsd-arch@freebsd.org" <arch@freebsd.org>,
 Peter Grehan <grehan@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Aug 2014 22:35:53 -0000

On Mon, Aug 18, 2014 at 3:26 PM, Warner Losh <imp@bsdimp.com> wrote:

>
> On Aug 18, 2014, at 2:13 PM, Peter Grehan <grehan@freebsd.org> wrote:
>
> >> Newer Intel CPUs have more entries, and AMD CPUs have long (since
> >> Barcelona) had more.  In particular, they allow 2 MB page mappings to =
be
> >> cached in a larger L2 TLB.  Nowadays, the trouble is with the 1 GB
> pages.
> >> A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB page=
s.
> >
> > There are new(ish) ones effectively without 1GB pages. From the
> "Software Optimization Guide for AMD Family 16h Processors"
> >
> > "Smashing"
> >  ...
> > "when the Family 16h processor encounters a 1-Gbyte page size, it will
> smash translations of that 1-Gbyte region into 2-Mbyte TLB entries, each
> > of which translates a 2-Mbyte region of the 1-Gbyte page."
>
> =E2=80=9Cwe=E2=80=99ll emulate this feature designed to make things go fa=
ster in hardware
> in software by doing the very thing that makes it go slow in hardware.=E2=
=80=9D
>
> Fun times. Performance Smashing!
>
>

I'm guessing that these are low-power processors, where they don't want to
have another CAM consuming power.  Under those circumstances, it's still
better to support 1 GB page mappings in the page table even if the TLB
doesn't support them than not to support 1 GB page mappings at all.  With
the hierarchical page tables on x86, you get a 512x reduction in page table
size with each increase in page size.  So, on a TLB miss, the page table
walk is more likely to be all L2 data cache hits, rather than misses that
go all the way to DRAM.

One feature that I always liked about the AMD performance counters was that
they allowed you to count L2 cache misses caused by page table walks on a
TLB miss.  This was often a better predictor of whether large pages were
going to be beneficial than counting TLB misses.

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 19 19:24:17 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 62593AD6
 for <freebsd-arch@freebsd.org>; Tue, 19 Aug 2014 19:24:17 +0000 (UTC)
Received: from mail-ie0-x22b.google.com (mail-ie0-x22b.google.com
 [IPv6:2607:f8b0:4001:c03::22b])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 2B0F9366D
 for <freebsd-arch@freebsd.org>; Tue, 19 Aug 2014 19:24:17 +0000 (UTC)
Received: by mail-ie0-f171.google.com with SMTP id at1so1726811iec.30
 for <freebsd-arch@freebsd.org>; Tue, 19 Aug 2014 12:24:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:reply-to:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=AhsNDcdclo1l2TRZsvQfLnLnR5DxqhmfcXkd/jJWcPc=;
 b=BNJrdKhVfXWJuAXB7PLbYJQ1f2XHe/S1cUUOsRgv2j/hQGSjHyDMOnWA931HCpuhO/
 7Nrebe6UCFNRsrOBGkSOvqZ0JVpmP8ESUBDvGxjFTka+s4V49HDvvnhDY+RE7kNsqSrV
 OumDKpSiltmk5hHj5HAeKvIrkq1kVMmHNWAaiKapROlL/zyxeG1IqnlnxlO+i+h6XeZO
 YhuS0+wEmV+4ePOsTZK8O57yLWtkU4zac8MMosYM8ZKH4qPJkULmk3zyiz3Vv0GqBiqV
 7WB5L29wvH44V4UNYkfCEV7BBD5Vhu0jBPd8j63bhuXMGY2NhIQDHWUuy++fvfsFy7FM
 rO9g==
MIME-Version: 1.0
X-Received: by 10.43.164.130 with SMTP id ms2mr44412552icc.9.1408476256600;
 Tue, 19 Aug 2014 12:24:16 -0700 (PDT)
Received: by 10.43.17.196 with HTTP; Tue, 19 Aug 2014 12:24:16 -0700 (PDT)
Reply-To: alc@freebsd.org
In-Reply-To: <20140817012646.GA21025@dft-labs.eu>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <1408064112-573-2-git-send-email-mjguzik@gmail.com>
 <20140816093811.GX2737@kib.kiev.ua>
 <20140816185406.GD2737@kib.kiev.ua>
 <20140817012646.GA21025@dft-labs.eu>
Date: Tue, 19 Aug 2014 14:24:16 -0500
Message-ID: <CAJUyCcPA7ZDNbwyfx3fT7mq3SE7M-mL5he=eXZ8bY3z-xUCJ-g@mail.gmail.com>
Subject: Re: [PATCH 1/2] Implement simple sequence counters with memory
 barriers.
From: Alan Cox <alan.l.cox@gmail.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1
Cc: Konstantin Belousov <kostikbel@gmail.com>, Johan Schuijt <johan@transip.nl>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 19 Aug 2014 19:24:17 -0000

On Sat, Aug 16, 2014 at 8:26 PM, Mateusz Guzik <mjguzik@gmail.com> wrote:

> On Sat, Aug 16, 2014 at 09:54:06PM +0300, Konstantin Belousov wrote:
> > On Sat, Aug 16, 2014 at 12:38:11PM +0300, Konstantin Belousov wrote:
> > > On Fri, Aug 15, 2014 at 02:55:11AM +0200, Mateusz Guzik wrote:
> > > > ---
> > > >  sys/sys/seq.h | 126
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 126 insertions(+)
> > > >  create mode 100644 sys/sys/seq.h
> > > >
> > > > diff --git a/sys/sys/seq.h b/sys/sys/seq.h
> > > > new file mode 100644
> > > > index 0000000..0971aef
> > > > --- /dev/null
> > > > +++ b/sys/sys/seq.h
> [..]
> > > > +#ifndef _SYS_SEQ_H_
> > > > +#define _SYS_SEQ_H_
> > > > +
> > > > +#ifdef _KERNEL
> > > > +
> > > > +/*
> > > > + * Typical usage:
> > > > + *
> > > > + * writers:
> > > > + *       lock_exclusive(&obj->lock);
> > > > + *       seq_write_begin(&obj->seq);
> > > > + *       .....
> > > > + *       seq_write_end(&obj->seq);
> > > > + *       unlock_exclusive(&obj->unlock);
> > > > + *
> > > > + * readers:
> > > > + *       obj_t lobj;
> > > > + *       seq_t seq;
> > > > + *
> > > > + *       for (;;) {
> > > > + *               seq =3D seq_read(&gobj->seq);
> > > > + *               lobj =3D gobj;
> > > > + *               if (seq_consistent(&gobj->seq, seq))
> > > > + *                       break;
> > > > + *               cpu_spinwait();
> > > > + *       }
> > > > + *       foo(lobj);
> > > > + */
> > > > +
> > > > +typedef uint32_t seq_t;
> > > > +
> > > > +/* A hack to get MPASS macro */
> > > > +#include <sys/systm.h>
> > > > +#include <sys/lock.h>
> > > > +
> > > > +#include <machine/cpu.h>
> > > > +
> > > > +static __inline bool
> > > > +seq_in_modify(seq_t seqp)
> > > > +{
> > > > +
> > > > + return (seqp & 1);
> > > > +}
> > > > +
> > > > +static __inline void
> > > > +seq_write_begin(seq_t *seqp)
> > > > +{
> > > > +
> > > > + MPASS(!seq_in_modify(*seqp));
> > > > + (*seqp)++;
> > > > + wmb();
> > > This probably ought to be written as atomic_add_rel_int(seqp, 1);
> > Alan Cox rightfully pointed out that better expression is
> > v =3D *seqp + 1;
> > atomic_store_rel_int(seqp, v);
> > which also takes care of TSO on x86.
> >
>
> Well, my memory-barrier-and-so-on-fu is rather weak.
>
> I had another look at the issue. At least on amd64, it looks like only
> compiler barrier is required for both reads and writes.
>
> According to AMD64 Architecture Programmer=E2=80=99s Manual Volume 2: Sys=
tem
> Programming, 7.2 Multiprocessor Memory Access Ordering states:
>
> "Loads do not pass previous loads (loads are not reordered). Stores do
> not pass previous stores (stores are not reordered)"
>
> Since the code modifying stuff only performs a series of writes and we
> expect exclusive writers, I find it applicable to this scenario.
>
> I checked linux sources and generated assembly, they indeed issue only
> a compiler barrier on amd64 (and for intel processors as well).
>
> atomic_store_rel_int on amd64 seems fine in this regard, but the only
> function for loads issues lock cmpxhchg which kills performance
> (median 55693659 -> 12789232 ops in a microbenchmark) for no gain.
>
> Additionally release and acquire semantics seems to be a stronger than
> needed guarantee.
>
>

This statement left me puzzled and got me to look at our x86 atomic.h for
the first time in years.  It appears that our implementation of
atomic_load_acq_int() on x86 is, umm ..., unconventional.  That is, it is
enforcing a constraint that simple acquire loads don't normally enforce.
For example, the C11 stdatomic.h simple acquire load doesn't enforce this
constraint.  Moreover, our own implementation of atomic_load_acq_int() on
ia64, where the mapping from atomic_load_acq_int() to machine instructions
is straightforward, doesn't enforce this constraint either.

Give us a chance to sort this out before you do anything further.  As
Kostik said, but in different words, we've always written our
machine-independent layer code using acquires and releases to express the
required ordering constraints and not {r,w}mb() primitives.



> As far as sequence counters go, we should be able to get away with
> making the following:
> - all relevant reads are performed between given points
> - all relevant writes are performed between given points
>
> As such, I propose introducing another atomic_* function variants
> (or stealing smp_{w,r,}mb idea from linux) which provide just that.
>
> So for amd64 reading guarantee and writing guarantee could be provided
> in the same way with a compiler barrier.
>
> > > Same note for all other linux-style barriers.  In fact, on x86
> > > wmb() is sfence and it serves no useful purpose in seq_write*.
> > >
> > > Overall, it feels too alien and linux-ish for my taste.
> > > Since we have sequence bound to some lock anyway, could we introduce
> > > some sort of generation-aware locks variants, which extend existing
> > > locks, and where lock/unlock bump generation number ?
> > Still, merging it to the guts of lock implementation is right
> > approach, IMO.
> >
>
> Current usage would be along with filedesc (sx) lock. The lock protects
> writes to entire fd table (and lock holders can block in malloc), while
> each file descriptor has its own counter. Also areas covered by seq are
> short and cannot block.
>
> As such, I don't really see any way to merge the lock with the counter.
>
> I agree it would be useful, provided area protected by the lock would be
> the same as the one protected by the counter. If this code hits the tree
> and one day turns out someone needs such functionality, there should not
> be any problems (apart from time effort) in implementing this.
>
> --
> Mateusz Guzik <mjguzik gmail.com>
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 20 14:14:17 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 99F90CF2
 for <arch@freebsd.org>; Wed, 20 Aug 2014 14:14:17 +0000 (UTC)
Received: from mail-la0-x22e.google.com (mail-la0-x22e.google.com
 [IPv6:2a00:1450:4010:c03::22e])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 23A3A33AA
 for <arch@freebsd.org>; Wed, 20 Aug 2014 14:14:16 +0000 (UTC)
Received: by mail-la0-f46.google.com with SMTP id b8so7430222lan.33
 for <arch@freebsd.org>; Wed, 20 Aug 2014 07:14:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=sender:date:from:to:subject:message-id:mail-followup-to
 :mime-version:content-type:content-disposition:user-agent;
 bh=679Ranpv3mj3AMEB7vEKIdLsDV9vUA3k9ZcEWr9DivI=;
 b=tQbaED1OOuLfgdnYpHquW9W59Ok40jw7DlwhK4c/W+QsBsjk58e+4yu/diO16f7dlt
 WAK5yypywKplmyU/xTHi9atZ5skpAF/OTcRBLYJQaALWovzTlAPFhUFBf0TggWn3jT1Q
 VTlee9WUiOpJfR3Oii7VoUtKnbMcp9MozjYqZffbsC1q2CRBk3XmD85xmH5xiCD4WQk1
 IUwa2E+6lKp3g556W5LgjSKDyaUgxBuApxevDCVhsnL3HtjqkYmoZqCKKP4aWuwvBt5a
 7toMCMI+tyJ2M2tnis2yEjhdvulLZ2zNhLNsq214vKpJV5iTva9ktknceLdvjMAbG8A9
 620A==
X-Received: by 10.152.36.195 with SMTP id s3mr42458925laj.28.1408544055014;
 Wed, 20 Aug 2014 07:14:15 -0700 (PDT)
Received: from pc5.home (adbj194.neoplus.adsl.tpnet.pl. [79.184.9.194])
 by mx.google.com with ESMTPSA id a1sm14515456lak.45.2014.08.20.07.14.13
 for <arch@freebsd.org>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Wed, 20 Aug 2014 07:14:14 -0700 (PDT)
Sender: =?UTF-8?Q?Edward_Tomasz_Napiera=C5=82a?= <etnapierala@gmail.com>
Date: Wed, 20 Aug 2014 16:14:11 +0200
From: Edward Tomasz =?utf-8?Q?Napiera=C5=82a?= <trasz@FreeBSD.org>
To: arch@FreeBSD.org
Subject: Autofs startup scripts.
Message-ID: <20140820141411.GB12179@pc5.home>
Mail-Followup-To: arch@FreeBSD.org
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Aug 2014 14:14:17 -0000

As it is now, autofs uses three separate rc.d scripts: automount,
automountd, and autounmountd.  They execute one utility and two deamons.
They are all controlled by a single rc var: autofs_enable.  Question
is: is this the right way to do it?  Would it be better to have only
one script instead?  If I went this route, how should configuring
command line options for each of the three executables work?


From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 20 16:00:40 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 2693371A;
 Wed, 20 Aug 2014 16:00:40 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id F0661307E;
 Wed, 20 Aug 2014 16:00:39 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id D78A1B9C4;
 Wed, 20 Aug 2014 12:00:38 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Benjamin Kaduk <bjk@freebsd.org>
Subject: Re: current fd allocation idiom
Date: Wed, 20 Aug 2014 11:10:10 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.4-CBSD-20140415; KDE/4.5.5; amd64; ; )
References: <20140717235538.GA15714@dft-labs.eu>
 <20140813015627.GC17869@dft-labs.eu>
 <alpine.GSO.1.10.1408151916150.21571@multics.mit.edu>
In-Reply-To: <alpine.GSO.1.10.1408151916150.21571@multics.mit.edu>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201408201110.10431.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Wed, 20 Aug 2014 12:00:38 -0400 (EDT)
Cc: Konstantin Belousov <kostikbel@gmail.com>,
 Mateusz Guzik <mjguzik@gmail.com>, freebsd-arch@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Aug 2014 16:00:40 -0000

On Friday, August 15, 2014 7:20:03 pm Benjamin Kaduk wrote:
> On Tue, 12 Aug 2014, Mateusz Guzik wrote:
> 
> > On Tue, Aug 12, 2014 at 09:31:15PM -0400, Benjamin Kaduk wrote:
> > > On Tue, Aug 12, 2014 at 7:36 PM, Mateusz Guzik <mjguzik@gmail.com> 
wrote:
> > >
> > > > I would expect soabort to result in a timeout/reset as opposed to 
regular
> > > > connection close.
> > > >
> > > > Comments around soabort suggest it should not be used as a replacement
> > > > for close, but maybe this is largely because of what the other end 
will
> > > > see. That will need to be investigated.
> > > >
> > > >
> > > I added some text regarding soabort to socket.9 in r266962 -- does that
> > > help clarify the situation?
> > >
> >
> > Nope. :-)
> >
> > It is unclear if the only motivation here is making sure nobody else
> > sees the socket when given thread calls soabort. This would be easily
> > guaranteed in here: fd allocation failed, fp with given socket was never
> > exposed to anyone.
> >
> > So, if you say soabort would work here just fine, I'm happy to use it
> > and blame you for problems. :-)
> 
> Hmm, I was hoping that jhb would chime in and save me from being on the
> hook, but it does look like soabort() would be acceptable in this case.

I think having the EMFILE/ENFILE case use the same exact logic as a listen 
queue overflow (i.e. soabort()) is correct.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 20 16:00:41 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 6B09A71C;
 Wed, 20 Aug 2014 16:00:41 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 40B66307F;
 Wed, 20 Aug 2014 16:00:41 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id 28E96B9CA;
 Wed, 20 Aug 2014 12:00:40 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Bruce Evans <brde@optusnet.com.au>
Subject: Re: [PATCH 0/2] plug capability races
Date: Wed, 20 Aug 2014 11:11:47 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.4-CBSD-20140415; KDE/4.5.5; amd64; ; )
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <201408151031.45967.jhb@freebsd.org> <20140816102840.V1007@besplex.bde.org>
In-Reply-To: <20140816102840.V1007@besplex.bde.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201408201111.47601.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Wed, 20 Aug 2014 12:00:40 -0400 (EDT)
Cc: Robert Watson <rwatson@freebsd.org>, Mateusz Guzik <mjguzik@gmail.com>,
 Konstantin Belousov <kib@freebsd.org>, Johan Schuijt <johan@transip.nl>,
 freebsd-arch@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Aug 2014 16:00:41 -0000

On Friday, August 15, 2014 9:34:59 pm Bruce Evans wrote:
> On Fri, 15 Aug 2014, John Baldwin wrote:
> 
> > One thing I would like to see is for the timecounter code to be adapted to use
> > the seq API instead of doing it by hand (the timecounter code is also missing
> > barriers due to doing it by hand).
> 
> Locking in the timecounter code is poor (1), but I fear a general mechanism
> would be slower.  Also, the timecounter code now extends into userland,
> so purely kernel locking cannot work for it.  The userland part is
> more careful about locking than the kernel.  It has memory barriers and
> other pessimizations which were intentionally left out of the kernel
> locking for timecounters.  If these barriers are actually necessary, then
> they give the silly situation that there are less races for userland
> timecounting than kernel timecounting provided userland mostly does
> direct accesses instead of syscalls and kernel uses of timecounters are
> are infrequent enough to not race often with the userland accesses.

Yes, the userland code is more correct here.  The barriers are indeed missing in
the kernel part, and adding them should give something equivalant to a correctly
working seq API as it is doing the same thing.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 20 16:15:57 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id BCA6DDC7
 for <arch@freebsd.org>; Wed, 20 Aug 2014 16:15:57 +0000 (UTC)
Received: from mail-la0-x22c.google.com (mail-la0-x22c.google.com
 [IPv6:2a00:1450:4010:c03::22c])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 476983349
 for <arch@freebsd.org>; Wed, 20 Aug 2014 16:15:57 +0000 (UTC)
Received: by mail-la0-f44.google.com with SMTP id el20so7641445lab.3
 for <arch@freebsd.org>; Wed, 20 Aug 2014 09:15:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:content-type:content-transfer-encoding;
 bh=9tz6pRzSioXhyPeMN2Pix2qrWd+bOt74pN40GoGgel0=;
 b=wFmyyo8T9YVEYxQwSVeyYkIykLZKqMu/U3GwuDWNyqQ3hh6pzgY3zFUP7q9S7nnxM2
 U7QTzDV+OeusnjxQVTjuxTXn9CAbf96IMXsqda196oLDhQkFhJWF2b8hMyp2XMgin21z
 0H1Mha6PKF0sW1jZ6klYCPYILyk2hZbrTrMlW5SN+DGHaBRTYpQhIhZKSwlHrd2kH45y
 hofqdNSnygQbl6GNMia9mUxoWo1DNSejTFjPKRpuI3aEyu1FOvLyPb3NytVDBA3CBB0v
 3GhXJhkKr58lzvfyJZxPmPUEEoq1UcABcGxq8lsYe5tH3iCdphgULezJHp9GufVSQElZ
 7C/A==
MIME-Version: 1.0
X-Received: by 10.152.22.165 with SMTP id e5mr22478131laf.57.1408551355016;
 Wed, 20 Aug 2014 09:15:55 -0700 (PDT)
Sender: crodr001@gmail.com
Received: by 10.112.197.107 with HTTP; Wed, 20 Aug 2014 09:15:54 -0700 (PDT)
In-Reply-To: <20140820141411.GB12179@pc5.home>
References: <20140820141411.GB12179@pc5.home>
Date: Wed, 20 Aug 2014 09:15:54 -0700
X-Google-Sender-Auth: juSB_INJEsQz6N3GjeO_CN0oqzE
Message-ID: <CAG=rPVeqZiWWXk4saCgk+-+azRpiZO54RwV4JOHJJscJHfT2fw@mail.gmail.com>
Subject: Re: Autofs startup scripts.
From: Craig Rodrigues <rodrigc@FreeBSD.org>
To: arch@freebsd.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Aug 2014 16:15:57 -0000

On Wed, Aug 20, 2014 at 7:14 AM, Edward Tomasz Napiera=C5=82a
<trasz@freebsd.org> wrote:
> As it is now, autofs uses three separate rc.d scripts: automount,
> automountd, and autounmountd.  They execute one utility and two deamons.
> They are all controlled by a single rc var: autofs_enable.  Question
> is: is this the right way to do it?  Would it be better to have only
> one script instead?  If I went this route, how should configuring
> command line options for each of the three executables work?

You could probably combine everything into one autofs script, since those
three scripts are very closely related.  You could have separate
automount_args, automountd_args, autounmountd_args variables for each
binary.  There is a freebsd-rc@ mailing list where you can ask for
help on this stuff, but it is a low traffic list.

--
Craig

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 20 19:31:15 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 162CFC31;
 Wed, 20 Aug 2014 19:31:15 +0000 (UTC)
Received: from mail.dawidek.net (garage.dawidek.net [91.121.88.72])
 by mx1.freebsd.org (Postfix) with ESMTP id D27723CC5;
 Wed, 20 Aug 2014 19:31:13 +0000 (UTC)
Received: from localhost (89-77-9-208.dynamic.chello.pl [89.77.9.208])
 by mail.dawidek.net (Postfix) with ESMTPSA id 9645915A;
 Wed, 20 Aug 2014 21:23:10 +0200 (CEST)
Date: Wed, 20 Aug 2014 21:24:19 +0200
From: Pawel Jakub Dawidek <pjd@FreeBSD.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Subject: Re: [PATCH 0/2] plug capability races
Message-ID: <20140820192419.GA1834@garage.freebsd.pl>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
X-OS: FreeBSD 11.0-CURRENT amd64
User-Agent: Mutt/1.5.22 (2013-10-16)
Cc: Konstantin Belousov <kib@FreeBSD.org>, Robert Watson <rwatson@FreeBSD.org>,
 Johan Schuijt <johan@transip.nl>, freebsd-arch@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Aug 2014 19:31:15 -0000

The patch looks good to me. Thanks for working on the fix, Mateusz!

The only minor nit I found is that fde_change, fde_change_size and
fde_seq should use capital letters as those are macros.

On Fri, Aug 15, 2014 at 02:55:10AM +0200, Mateusz Guzik wrote:
> fget_unlocked currently reads 'fde' which is a structure consisting of
> serveral fields. In effect the read is inatomic and may result in
> obtaining file pointer with stale or incorrect capabilities.
> 
> Example race is with dup2.
> 
> Side effect is that capability checks can be circumvented.
> 
> Proposed way to fix it is with the help of sequence counters.
> 
> Patchset assumes stuff from
> 'Getting rid of atomic_load_acq_int(&fdp->fd_nfiles)) from fget_unlocked'
> ( http://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015550.html )
> is applied. There is no technical dependency between patches (apart from
> READ_ONCE), but this patch amortizes performance hit introduced with seqlock.
> 
> So this introduces a measurable hit with a microbenchmark (16 threads
> reading from a pipe which fails with EAGAIN), but is still much faster than
> current code with atomic_load_acq_int(&fdp->fd_nfiles).
> 
> x propernoacq-readpipe-run-sum
> + seq2-noacq-readpipe-run-sum
> N           Min           Max        Median           Avg        Stddev
> x  20      59479718      59527286      59496714      59499504     13752.968
> +  20      54520752      54920054      54829539      54773480     136842.96
> Difference at 95.0% confidence
>     	-4.72602e+06 +/- 62244.4
> 	-7.94296% +/- 0.104613%
> 	(Student's t, pooled s = 97250)
> 
> There is still one theoretical race unfixed, but I don't believe it matters
> much.
> 
> The race is:
> fp gets reallocated before refcount check. this resuls in returning fp
> regardless of new caps, but I don't see how this particular race could be
> exploited. It could be fixed by re-reading entire fde and checking if it
> changed.
> 
> -- 
> 2.0.2
> 

-- 
Pawel Jakub Dawidek                       http://www.wheelsystems.com
FreeBSD committer                         http://www.FreeBSD.org
Am I Evil? Yes, I Am!                     http://mobter.com

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 20 20:23:06 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id DE8A5475;
 Wed, 20 Aug 2014 20:23:05 +0000 (UTC)
Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au
 [211.29.132.53])
 by mx1.freebsd.org (Postfix) with ESMTP id 893D432A9;
 Wed, 20 Aug 2014 20:23:05 +0000 (UTC)
Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au
 (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133])
 by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 7BD19D448FA;
 Thu, 21 Aug 2014 06:22:56 +1000 (EST)
Date: Thu, 21 Aug 2014 06:22:55 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: John Baldwin <jhb@freebsd.org>
Subject: Re: [PATCH 0/2] plug capability races
In-Reply-To: <201408201111.47601.jhb@freebsd.org>
Message-ID: <20140821044234.H11472@besplex.bde.org>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <201408151031.45967.jhb@freebsd.org> <20140816102840.V1007@besplex.bde.org>
 <201408201111.47601.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=BdjhjNd2 c=1 sm=1 tr=0
 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=tTSYktBZc9AA:10
 a=KN91Z2BipYgA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8
 a=YwIKbEHEEUB-9GaOcawA:9 a=kzYn1Pzwvs4spdd-:21 a=Ip1ZeEM7m2elqLRx:21
 a=CjuIK1q_8ugA:10
Cc: Mateusz Guzik <mjguzik@gmail.com>, Robert Watson <rwatson@freebsd.org>,
 Johan Schuijt <johan@transip.nl>, freebsd-arch@freebsd.org,
 Konstantin Belousov <kib@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Aug 2014 20:23:06 -0000

On Wed, 20 Aug 2014, John Baldwin wrote:

> On Friday, August 15, 2014 9:34:59 pm Bruce Evans wrote:
>> On Fri, 15 Aug 2014, John Baldwin wrote:
>>
>>> One thing I would like to see is for the timecounter code to be adapted to use
>>> the seq API instead of doing it by hand (the timecounter code is also missing
>>> barriers due to doing it by hand).
>>
>> Locking in the timecounter code is poor (1), but I fear a general mechanism
>> would be slower.  Also, the timecounter code now extends into userland,
>> so purely kernel locking cannot work for it.  The userland part is
>> more careful about locking than the kernel.  It has memory barriers and
>> other pessimizations which were intentionally left out of the kernel
>> locking for timecounters.  If these barriers are actually necessary, then
>> they give the silly situation that there are less races for userland
>> timecounting than kernel timecounting provided userland mostly does
>> direct accesses instead of syscalls and kernel uses of timecounters are
>> are infrequent enough to not race often with the userland accesses.
>
> Yes, the userland code is more correct here.  The barriers are indeed missing in
> the kernel part, and adding them should give something equivalant to a correctly
> working seq API as it is doing the same thing.

Userland is technically correct, but this defeats the point of the
intended algorithm.

I now remember a bit more about the algorithm.  There are several
generations of timehands.  Each generation remains stable for several
clock ticks.  That should be several clock ticks at 100 Hz.  Normally
there is no problem with just using the old pointer read from timehands
(except there is no serialization for updating timehands itself (*)).
However, the thread might be preempted for several clock ticks.  This
is enough time for the old generation to change.  The generation count
is used to detect such changes.  Again it doesn't matter if the
generation count is out of date, unless it is out of date by a few
generations.  So the algorithm works unless the CPU de-serializes
things by more than a few clock ticks.  I think no real CPUs do that.
Virtual CPUs can do that, but I think they aren't a problem in practice.
Single stepping in ddb gives a sort of virtual CPU and breaks the
algorthm since time runs much faster outside of the stepped process
and may do several generations per step.  The generation count protects
against using a changed timehands but may cause binuptime() to never
terminate instead.  It takes much weirder virtualization than that to
break the generation count itself.  Any normal preemption or abnormal
stopping of CPUs uses locks galore which synchronize everything on at
least x86.

Variable-tick kernels give another problem.  They sometimes issue virtual
clock interrupts to catch up.  I think they take some care with tc_windup()
but perhaps not enough.  tc_windup() calls must be separated so that the
timehands don't cycle too fast or too slow in either real time or time
related to other system operation (there are hard real time requirements
mainly for reading real hardware timecounters before they overflow).

(*):

% binuptime(struct bintime *bt)
% {
% 	struct timehands *th;
% 	u_int gen;
% 
% 	do {
% 		th = timehands;

Since tc_windup() also doesn't dream of memory ordering, timehands here
may be in the future of what it points to.  That is much worse than it
being in the past.  Barriers would be cheap in tc_windup() but useless
if they require barriers in binuptime() to work.

tc_windup() is normally called from the clock interrupt handler.  There
are several mutexes (or at least atomic ops that give synchronization on
at least x86 SMP) before and after it.  These gives serialization very
soon after the changes.

The fix (without adding any barrier instructions) is easy.  Simply
run the timehands update 1 or 2 generations behind the update of what
it points to.  This gives even more than time-domain locking, since
the accidental synchronization from the interrupt handler gives ordering
between the update of the pointed-to data and the timehands pointer.

% 		gen = th->th_generation;

It doesn't matter if the generation count is in the future, but it needs
to be the same as what was written in the past or future.

% 		*bt = th->th_offset;
% 		bintime_addx(bt, th->th_scale * tc_delta(th));
% 	} while (gen == 0 || gen != th->th_generation);
% }

Now the timehands update code:

% 	/*
% 	 * Now that the struct timehands is again consistent, set the new
% 	 * generation number, making sure to not make it zero.
% 	 */

It is only sure to be consistent on in-order CPUs.

% 	if (++ogen == 0)
% 		ogen = 1;
% 	th->th_generation = ogen;
% 
% 	/* Go live with the new struct timehands. */
% #ifdef FFCLOCK
% 	switch (sysclock_active) {
% 	case SYSCLOCK_FBCK:
% #endif

I don't like the FFCLOCK complications.  They interact with the locking
bugs a little here.

% 		time_second = th->th_microtime.tv_sec;
% 		time_uptime = th->th_offset.sec;

Old versions had only these 2 statements before setting timehands and
returning.  These are racy enough.  Using these variables is racier.
They have type time_t, so they might be 64 bits on 32-bit arches so
reading them might be non-atomic.  In practice, very strong time-domain
locking applies -- the races won't occur until the top bits start being
actually used a mere 24 years from now.  Then there will be a race window
of a few microseconds.  The generation count should be used to make accesses
to these variables techically correct and slow.

% #ifdef FFCLOCK
% 		break;
% 	case SYSCLOCK_FFWD:
% 		time_second = fftimehands->tick_time_lerp.sec;
% 		time_uptime = fftimehands->tick_time_lerp.sec - ffclock_boottime.sec;
% 		break;

Perhaps more races from more complicated expressions.  Also a style bug
(long line).

% 	}
% #endif
% 
% 	timehands = th;
% 	timekeep_push_vdso();
% }

timekeep_push_vdso()  has a couple of atomic stores in it.  Perhaps
these give perfect serialization for the user variables.  On some
arches, they accidentally sync the kernel variables a little earlier
than the accidental sync from the interrupt handler.  Still out of order
with the kernel variable updates.  Again, this shouldn't be needed --
use a delayed pointer update for the user variables too.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Thu Aug 21 03:34:58 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id DE494AB2;
 Thu, 21 Aug 2014 03:34:58 +0000 (UTC)
Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au
 [211.29.132.59])
 by mx1.freebsd.org (Postfix) with ESMTP id 88D463C6F;
 Thu, 21 Aug 2014 03:34:58 +0000 (UTC)
Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au
 (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133])
 by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id AFCF61A2C2E;
 Thu, 21 Aug 2014 13:34:48 +1000 (EST)
Date: Thu, 21 Aug 2014 13:34:47 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
Subject: Re: [PATCH 0/2] plug capability races
In-Reply-To: <20140821044234.H11472@besplex.bde.org>
Message-ID: <20140821113753.D933@besplex.bde.org>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <201408151031.45967.jhb@freebsd.org> <20140816102840.V1007@besplex.bde.org>
 <201408201111.47601.jhb@freebsd.org> <20140821044234.H11472@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=AOuw8Gd4 c=1 sm=1 tr=0
 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=tTSYktBZc9AA:10
 a=KN91Z2BipYgA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8
 a=rnMR6aIR_FJUPtQO_FsA:9 a=W3cemsHr8jZuBReB:21 a=LqShDMf_JJxsx4l9:21
 a=CjuIK1q_8ugA:10
Cc: Mateusz Guzik <mjguzik@gmail.com>, Robert Watson <rwatson@freebsd.org>,
 Johan Schuijt <johan@transip.nl>, freebsd-arch@freebsd.org,
 Konstantin Belousov <kib@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Aug 2014 03:34:59 -0000

On Thu, 21 Aug 2014, Bruce Evans wrote:

> ...
> I now remember a bit more about the algorithm.  There are several
> generations of timehands.  Each generation remains stable for several
> clock ticks.  That should be several clock ticks at 100 Hz.  Normally
> there is no problem with just using the old pointer read from timehands
> (except there is no serialization for updating timehands itself (*)).
> ...
> (*):
>
> % binuptime(struct bintime *bt)
> % {
> % 	struct timehands *th;
> % 	u_int gen;
> % % 	do {
> % 		th = timehands;
>
> Since tc_windup() also doesn't dream of memory ordering, timehands here
> may be in the future of what it points to.  That is much worse than it
> being in the past.  Barriers would be cheap in tc_windup() but useless
> if they require barriers in binuptime() to work.
>
> tc_windup() is normally called from the clock interrupt handler.  There
> are several mutexes (or at least atomic ops that give synchronization on
> at least x86 SMP) before and after it.  These gives serialization very
> soon after the changes.
>
> The fix (without adding any barrier instructions) is easy.  Simply
> run the timehands update 1 or 2 generations behind the update of what
> it points to.  This gives even more than time-domain locking, since
> the accidental synchronization from the interrupt handler gives ordering
> between the update of the pointed-to data and the timehands pointer.
> ...

More details:
- lock tc_windup() and tc_ticktock() using a spinlock

- add hard real-time rate limiting and error recovery so that the timehands
   are not cycled through too fast or too slow.

     tc_ticktock() already does this for calls from the clock interrupt
     handler except when clock interrupts are non-hard.  tc_ticktock()
     can use mtx_trylock() and do nothing if the mutex is contested.

     tc_setclock() and possibly inittimecounter() should wait to synchronize
     with the next clock interrupt that would call tc_windup(), and advance
     the time that they set by the wait delay plus previous delays, and even
     more, since its changes shouldn't go live for several generations.  It
     sort of does this now, in a broken way.  It corrupts the boot time
     using racy accesses.  This limits problems from large adjustments to
     realtime clock ids (the ones that add the boot time).  There are no
     further delays, just races accessing the boot time in critical places
     like boottime().  Delays are now also limited by calling tc_windup()
     and tc_windup() going live with updated timehands almost immediately
     (as soon as it complete).  The immediate tc_windup() call is commented
     on as being to fiddle with all the crinkly bits aroudn the fiords, but
     the only criticial thing it does is update the generation count in
     a fiarly non-racy way -- this tells bintime() to loop, so it has a
     chance of picking up the changed boot time with a coherent value.

     sysctl_kern_timecounter_hardware() should call tc_windup() to do
     a staged update way much like for tc_setclock().  It refrains from
     doing this because of the races, but it hacks on the timehands
     pointer in a different and even more fragile racy way.  It now calls
     timekeep_push_vdso() to do the userland part of tc_windup().

     The timehands may be recycled too slowly.  This happens mainly on
     suspension.  The system depends on frequent windups to work, so
     it can't run really tickless.  After suspension, all old generations
     are garbage but their generation counts might not have been updated
     to indicate this.  The system should at least try to detect this.
     I don't understand what happens for timecounters on resume now.

- in tc_windup(), bump the generation count for the second-oldest generation
   instead of setting it to 0 for the current generation, and update the
   timehands for the oldest generation instead of changing them for the
   current generation.  This also fixes busy-waiting and contention on the
   timehands for the current generation during the windup.

     Using the special generation count of 0 essentially reduces the
     "queue" of timehands from length 10 to length 0 during the
     windup, at a cost of complications and bugs.

     It also makes the other 9 generations of the timehands almost
     never used, and not very useful.  1 generation together with a
     generation count that is set to 0 during windups suffices, at
     the cost of spinning while the generation count is 0 and
     complications and bugs in accesses to the generation count.

     But the current version already has all these costs in the usual
     case where the generation changes.  tc_windup() is supposed to
     run with interrupts disabled, so that it cannot be preempted
     and the length of the spinning is bounded.  (Having only Giant
     locking for the call in settime() is even worse than first appeared.
     It doesn't prevent preemption at all, so the length of the spinning
     is unbounded.)

     In unusal cases, binuptime() is preempted and the generation count
     changes many times before the original timehands is used.  Then
     the pointer to it is invalid.  But the generation count in it has
     increased by more than usual, so the change is detected and the
     pointer is updated.  So old generations are not used for storing
     anything important except for the generation count, and having 10
     generations just reduces the rate of increase of generation counts
     by a factor of 10, so it takes preemption by 10 ** 2^32 windups
     instead of only 2**32 for the algorithm to by broken by wraparound
     of the generation count (with HZ = 1000, that is 490 days of
     preemption instead of only 49).

   The delayed updates might cause different complications.  I think
   ntp seconds updates strictly should to be done in advance so as
   to go live on seconds rollover.  The details can't be too
   critical, since with HZ = 100 tc_windup calls are out of sync with
   seconds rollovers by an average of 5 milliseconds (+-5) and no one
   seemed to notice problems from that.  Isn't there an error of 1
   second for the duration of the sync time around leap seconds
   adjustments?  With HZ = 1000 the update "queue" with intentionally
   delayed updates could have length 5 and give much the same
   behaviour except for missing races (the average delay would still
   be 5 milliseconds but now +-0.5).

Bruce