From owner-freebsd-arch@FreeBSD.ORG Sun Aug 17 01:26:58 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DEE43BE6; Sun, 17 Aug 2014 01:26:57 +0000 (UTC) Received: from mail-wg0-x22b.google.com (mail-wg0-x22b.google.com [IPv6:2a00:1450:400c:c00::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4917A23D8; Sun, 17 Aug 2014 01:26:57 +0000 (UTC) Received: by mail-wg0-f43.google.com with SMTP id l18so3612006wgh.14 for ; Sat, 16 Aug 2014 18:26:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=32AQpC+aVQbFOxyzwAsARKVHWmDo+UsYIP8IUKUFVX8=; b=bdhCwnryhRY/P5L1y/SFV53kx5K7U2XHkeEL/vtHaxFrFXMYSH6qFtx2EMYt6TKM+Z cv3+76otQKRB3BHB+kgdowzjVgGdTzCWtPOwz1bVWeP7N4iN3jbbl5lrkb/j+9D8jIb+ QHUcQy9DsNkpxhhnS5snoSqEqysgOh4vUYdfwqPU0lwXo9EvhzrTlpUfBH/JFCpcYDM3 svuCrBWHnx18jT7BSU4jmFK936V4Rj32QKLjwAQ6L3SmlZbPDq/ds2cHffqvipbnSrYE 6EdvQ0vF+7ayYecSrb5RraZafYa/oRjrfWcp9K+Miurwsew9ejE2ypeZvL+bTZu0WDU2 tXGw== X-Received: by 10.180.89.100 with SMTP id bn4mr18059090wib.34.1408238815528; Sat, 16 Aug 2014 18:26:55 -0700 (PDT) Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net. [2001:470:1f08:1f7::2]) by mx.google.com with ESMTPSA id w1sm22141460wiz.14.2014.08.16.18.26.54 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Sat, 16 Aug 2014 18:26:54 -0700 (PDT) Date: Sun, 17 Aug 2014 03:26:47 +0200 From: Mateusz Guzik To: Konstantin Belousov Subject: Re: [PATCH 1/2] Implement simple sequence counters with memory barriers. Message-ID: <20140817012646.GA21025@dft-labs.eu> References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> <1408064112-573-2-git-send-email-mjguzik@gmail.com> <20140816093811.GX2737@kib.kiev.ua> <20140816185406.GD2737@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20140816185406.GD2737@kib.kiev.ua> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Johan Schuijt , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Aug 2014 01:26:58 -0000 On Sat, Aug 16, 2014 at 09:54:06PM +0300, Konstantin Belousov wrote: > On Sat, Aug 16, 2014 at 12:38:11PM +0300, Konstantin Belousov wrote: > > On Fri, Aug 15, 2014 at 02:55:11AM +0200, Mateusz Guzik wrote: > > > --- > > > sys/sys/seq.h | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 126 insertions(+) > > > create mode 100644 sys/sys/seq.h > > > > > > diff --git a/sys/sys/seq.h b/sys/sys/seq.h > > > new file mode 100644 > > > index 0000000..0971aef > > > --- /dev/null > > > +++ b/sys/sys/seq.h [..] > > > +#ifndef _SYS_SEQ_H_ > > > +#define _SYS_SEQ_H_ > > > + > > > +#ifdef _KERNEL > > > + > > > +/* > > > + * Typical usage: > > > + * > > > + * writers: > > > + * lock_exclusive(&obj->lock); > > > + * seq_write_begin(&obj->seq); > > > + * ..... > > > + * seq_write_end(&obj->seq); > > > + * unlock_exclusive(&obj->unlock); > > > + * > > > + * readers: > > > + * obj_t lobj; > > > + * seq_t seq; > > > + * > > > + * for (;;) { > > > + * seq = seq_read(&gobj->seq); > > > + * lobj = gobj; > > > + * if (seq_consistent(&gobj->seq, seq)) > > > + * break; > > > + * cpu_spinwait(); > > > + * } > > > + * foo(lobj); > > > + */ > > > + > > > +typedef uint32_t seq_t; > > > + > > > +/* A hack to get MPASS macro */ > > > +#include > > > +#include > > > + > > > +#include > > > + > > > +static __inline bool > > > +seq_in_modify(seq_t seqp) > > > +{ > > > + > > > + return (seqp & 1); > > > +} > > > + > > > +static __inline void > > > +seq_write_begin(seq_t *seqp) > > > +{ > > > + > > > + MPASS(!seq_in_modify(*seqp)); > > > + (*seqp)++; > > > + wmb(); > > This probably ought to be written as atomic_add_rel_int(seqp, 1); > Alan Cox rightfully pointed out that better expression is > v = *seqp + 1; > atomic_store_rel_int(seqp, v); > which also takes care of TSO on x86. > Well, my memory-barrier-and-so-on-fu is rather weak. I had another look at the issue. At least on amd64, it looks like only compiler barrier is required for both reads and writes. According to AMD64 Architecture Programmer’s Manual Volume 2: System Programming, 7.2 Multiprocessor Memory Access Ordering states: "Loads do not pass previous loads (loads are not reordered). Stores do not pass previous stores (stores are not reordered)" Since the code modifying stuff only performs a series of writes and we expect exclusive writers, I find it applicable to this scenario. I checked linux sources and generated assembly, they indeed issue only a compiler barrier on amd64 (and for intel processors as well). atomic_store_rel_int on amd64 seems fine in this regard, but the only function for loads issues lock cmpxhchg which kills performance (median 55693659 -> 12789232 ops in a microbenchmark) for no gain. Additionally release and acquire semantics seems to be a stronger than needed guarantee. As far as sequence counters go, we should be able to get away with making the following: - all relevant reads are performed between given points - all relevant writes are performed between given points As such, I propose introducing another atomic_* function variants (or stealing smp_{w,r,}mb idea from linux) which provide just that. So for amd64 reading guarantee and writing guarantee could be provided in the same way with a compiler barrier. > > Same note for all other linux-style barriers. In fact, on x86 > > wmb() is sfence and it serves no useful purpose in seq_write*. > > > > Overall, it feels too alien and linux-ish for my taste. > > Since we have sequence bound to some lock anyway, could we introduce > > some sort of generation-aware locks variants, which extend existing > > locks, and where lock/unlock bump generation number ? > Still, merging it to the guts of lock implementation is right > approach, IMO. > Current usage would be along with filedesc (sx) lock. The lock protects writes to entire fd table (and lock holders can block in malloc), while each file descriptor has its own counter. Also areas covered by seq are short and cannot block. As such, I don't really see any way to merge the lock with the counter. I agree it would be useful, provided area protected by the lock would be the same as the one protected by the counter. If this code hits the tree and one day turns out someone needs such functionality, there should not be any problems (apart from time effort) in implementing this. -- Mateusz Guzik From owner-freebsd-arch@FreeBSD.ORG Sun Aug 17 10:44:34 2014 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 57FF3513; Sun, 17 Aug 2014 10:44:34 +0000 (UTC) Received: from mail.beastielabs.net (unknown [IPv6:2001:888:1227:0:200:24ff:fec9:5934]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B0C0D25AB; Sun, 17 Aug 2014 10:44:33 +0000 (UTC) Received: from beastie.hotsoft.nl (beastie.hotsoft.nl [IPv6:2001:888:1227:0:219:d1ff:fee8:91eb]) by mail.beastielabs.net (8.14.7/8.14.7) with ESMTP id s7HAiUnN059437; Sun, 17 Aug 2014 12:44:30 +0200 (CEST) (envelope-from hans@beastielabs.net) Message-ID: <53F0878E.3000401@beastielabs.net> Date: Sun, 17 Aug 2014 12:44:30 +0200 From: Hans Ottevanger User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: =?UTF-8?B?RWR3YXJkIFRvbWFzeiBOYXBpZXJhxYJh?= Subject: Re: [CFT] Autofs. References: <20140730071933.GA20122@pc5.home> In-Reply-To: <20140730071933.GA20122@pc5.home> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Cc: freebsd-current@FreeBSD.org, freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Aug 2014 10:44:34 -0000 On 07/30/14 09:19, Edward Tomasz Napierała wrote: > At the link below you will find a patch that adds the new automounter. > The patch is against yesterdays 11.0-CURRENT. > > http://people.freebsd.org/~trasz/autofs-head-20140729.diff > > Slides that explain the project scope and deliverables are here: > > http://people.freebsd.org/~trasz/autofs.pdf > > Testing is welcome. Please start with manual pages, eg. automount(8). > Note that you need not only to rebuild both kernel and world, but also > to run mergemaster, to install required /etc files. To run at startup, > add 'autofs_enable="YES"' to /etc/rc.conf. > > This project is being sponsored by FreeBSD Foundation. > Hi! Great to see a real autofs finally coming to FreeBSD. I already did some very cursory testing on a recent 11-CURRENT system that I still happened to have and things with at least the /net map look quite OK. I could do some more extensive testing if I could use some of my 10-STABLE systems. I already checked that the patch applies cleanly to a recent 10-STABLE (modulo a few offsets) and that both buildworld and buildkernel succeed. Should I expect difficulties actually running your autofs on 10-STABLE? And do you plan support for NIS? I know NIS is quite dead and has been so for at least 20 years, but I still see it being used occasionally (probably most out of habit) and it is (still ?) available in the base-system. Kind regards, Hans From owner-freebsd-arch@FreeBSD.ORG Sun Aug 17 13:22:58 2014 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 16F24A7D; Sun, 17 Aug 2014 13:22:58 +0000 (UTC) Received: from outpost1.zedat.fu-berlin.de (outpost1.zedat.fu-berlin.de [130.133.4.66]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A5E4327FF; Sun, 17 Aug 2014 13:22:57 +0000 (UTC) Received: from inpost2.zedat.fu-berlin.de ([130.133.4.69]) by outpost.zedat.fu-berlin.de (Exim 4.82) with esmtp (envelope-from ) id <1XJ0QB-000LJR-7e>; Sun, 17 Aug 2014 15:22:55 +0200 Received: from g229053128.adsl.alicedsl.de ([92.229.53.128] helo=thor.walstatt.dynvpn.de) by inpost2.zedat.fu-berlin.de (Exim 4.82) with esmtpsa (envelope-from ) id <1XJ0QB-002tJx-3u>; Sun, 17 Aug 2014 15:22:55 +0200 Date: Sun, 17 Aug 2014 15:22:54 +0200 From: "O. Hartmann" To: Hans Ottevanger Subject: Re: [CFT] Autofs. Message-ID: <20140817152254.1e2786db.ohartman@zedat.fu-berlin.de> In-Reply-To: <53F0878E.3000401@beastielabs.net> References: <20140730071933.GA20122@pc5.home> <53F0878E.3000401@beastielabs.net> Organization: FU Berlin X-Mailer: Claws Mail 3.10.1 (GTK+ 2.24.22; amd64-portbld-freebsd11.0) MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_//tezXkiRzb233=u3GdmP179"; protocol="application/pgp-signature" X-Originating-IP: 92.229.53.128 X-ZEDAT-Hint: A Cc: freebsd-current@FreeBSD.org, Edward Tomasz =?utf-8?Q?Napiera=C5=82a?= , freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Aug 2014 13:22:58 -0000 --Sig_//tezXkiRzb233=u3GdmP179 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Am Sun, 17 Aug 2014 12:44:30 +0200 Hans Ottevanger schrieb: > On 07/30/14 09:19, Edward Tomasz Napiera=C5=82a wrote: > > At the link below you will find a patch that adds the new automounter. > > The patch is against yesterdays 11.0-CURRENT. > > > > http://people.freebsd.org/~trasz/autofs-head-20140729.diff > > > > Slides that explain the project scope and deliverables are here: > > > > http://people.freebsd.org/~trasz/autofs.pdf > > > > Testing is welcome. Please start with manual pages, eg. automount(8). > > Note that you need not only to rebuild both kernel and world, but also > > to run mergemaster, to install required /etc files. To run at startup, > > add 'autofs_enable=3D"YES"' to /etc/rc.conf. > > > > This project is being sponsored by FreeBSD Foundation. > > >=20 > Hi! >=20 > Great to see a real autofs finally coming to FreeBSD. >=20 > I already did some very cursory testing on a recent 11-CURRENT system=20 > that I still happened to have and things with at least the /net map look= =20 > quite OK. >=20 > I could do some more extensive testing if I could use some of my=20 > 10-STABLE systems. I already checked that the patch applies cleanly to a= =20 > recent 10-STABLE (modulo a few offsets) and that both buildworld and=20 > buildkernel succeed. Should I expect difficulties actually running your=20 > autofs on 10-STABLE? >=20 > And do you plan support for NIS? I know NIS is quite dead and has been=20 > so for at least 20 years, but I still see it being used occasionally=20 > (probably most out of habit) and it is (still ?) available in the=20 > base-system. >=20 > Kind regards, >=20 > Hans >=20 > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" Is this "new" autofs of the same type and concept as the autofs used in Lin= ux for more than a decade now? --Sig_//tezXkiRzb233=u3GdmP179 Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBAgAGBQJT8KyuAAoJEOgBcD7A/5N8wUcIALO/3aHJq2q2udeRrHvvX552 0LTB1pRdaNzFYWP8obX6D0eMmpc6qkBAYQ3FjVWfDI3bBctMJQOM3949jIpBJ6ET 0UGyDsdx0wCkxDL69vf7AJ1G4ECZuckpgIzhczXMrUaz7oEPL8cSoJdtYhbARayU Mv7/YqFvoYvBuWI80g3dLmXTxOKXTZcC9SWPeJNC/njrJOtCxn8cevz6gMBp3fLS /uqt3jLXYbkK+cDxhE5Rm7CNdjdkJfsFbX1a/4mUXM+3yX0onMeL5fVahEtyiye/ d4RokjF2VVNgUyMt4RyRshLKI48O7JfQ57AK+IO0xM+HAg/s1vFybzSTjVVhxVU= =cZaN -----END PGP SIGNATURE----- --Sig_//tezXkiRzb233=u3GdmP179-- From owner-freebsd-arch@FreeBSD.ORG Sun Aug 17 14:51:02 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 862F73ED; Sun, 17 Aug 2014 14:51:02 +0000 (UTC) Received: from mail-la0-x235.google.com (mail-la0-x235.google.com [IPv6:2a00:1450:4010:c03::235]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D32C2205D; Sun, 17 Aug 2014 14:51:01 +0000 (UTC) Received: by mail-la0-f53.google.com with SMTP id gl10so3778899lab.12 for ; Sun, 17 Aug 2014 07:50:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:mail-followup-to :references:mime-version:content-type:content-disposition :content-transfer-encoding:in-reply-to:user-agent; bh=MuyiV+EyhnLHm8Dvo+77CunOnSeiifyyhLrf7a/5Jgg=; b=ttR4imc4RpEzf6NY6OlPFFeVXIIGB+WEa7zyF/ctmIMEVRfj/HXOCozQdP23cjEDbj t6RQ3YFIybUV+BW72V2eLXrm9IlOM6FCDkRZEPZE3WMIQ3tXy5XaqwlHnxlBJNmf2j2Z G36qTMTyD2Hk552cReuFhbjxhtqdNyRmQVfi7NuJ61u8moaf/wYJ5/19XA9ALrb1fjwv FZpIwJdHhwRMfgc84PjNEddAQruJneeVHCm9VV79BNJBoRfNysPQNq3+WmEWGVEceJk6 3EI/LfS1ClXt/ZY6XcqkySEZZBe6BaN+aICrffrF68zw3paqhbTo4rbn7rCnFluHjnUx jMvw== X-Received: by 10.152.164.70 with SMTP id yo6mr23792407lab.2.1408287059767; Sun, 17 Aug 2014 07:50:59 -0700 (PDT) Received: from pc5.home (abpi45.neoplus.adsl.tpnet.pl. [83.8.50.45]) by mx.google.com with ESMTPSA id h3sm8741756lah.20.2014.08.17.07.50.58 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 17 Aug 2014 07:50:59 -0700 (PDT) Sender: =?UTF-8?Q?Edward_Tomasz_Napiera=C5=82a?= Date: Sun, 17 Aug 2014 16:50:59 +0200 From: Edward Tomasz =?utf-8?Q?Napiera=C5=82a?= To: Hans Ottevanger Subject: Re: [CFT] Autofs. Message-ID: <20140817145059.GA5497@pc5.home> Mail-Followup-To: Hans Ottevanger , freebsd-arch@FreeBSD.org, freebsd-current@FreeBSD.org References: <20140730071933.GA20122@pc5.home> <53F0878E.3000401@beastielabs.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <53F0878E.3000401@beastielabs.net> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-current@FreeBSD.org, freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Aug 2014 14:51:02 -0000 On 0817T1244, Hans Ottevanger wrote: > On 07/30/14 09:19, Edward Tomasz Napierała wrote: > >At the link below you will find a patch that adds the new automounter. > >The patch is against yesterdays 11.0-CURRENT. > > > >http://people.freebsd.org/~trasz/autofs-head-20140729.diff > > > >Slides that explain the project scope and deliverables are here: > > > >http://people.freebsd.org/~trasz/autofs.pdf > > > >Testing is welcome. Please start with manual pages, eg. automount(8). > >Note that you need not only to rebuild both kernel and world, but also > >to run mergemaster, to install required /etc files. To run at startup, > >add 'autofs_enable="YES"' to /etc/rc.conf. > > > >This project is being sponsored by FreeBSD Foundation. > > > > Hi! > > Great to see a real autofs finally coming to FreeBSD. > > I already did some very cursory testing on a recent 11-CURRENT system > that I still happened to have and things with at least the /net map > look quite OK. > > I could do some more extensive testing if I could use some of my > 10-STABLE systems. I already checked that the patch applies cleanly > to a recent 10-STABLE (modulo a few offsets) and that both buildworld > and buildkernel succeed. Should I expect difficulties actually > running your autofs on 10-STABLE? No, it should be fine. Plan is to MFC this to 10 soon, btw. > And do you plan support for NIS? I know NIS is quite dead and has > been so for at least 20 years, but I still see it being used > occasionally (probably most out of habit) and it is (still ?) > available in the base-system. It should be trivial to add, I just need someone with such setup (autofs maps in NIS) to test it against. From owner-freebsd-arch@FreeBSD.ORG Sun Aug 17 14:52:20 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D917C66B; Sun, 17 Aug 2014 14:52:20 +0000 (UTC) Received: from mail-lb0-x229.google.com (mail-lb0-x229.google.com [IPv6:2a00:1450:4010:c04::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 2097B20E6; Sun, 17 Aug 2014 14:52:19 +0000 (UTC) Received: by mail-lb0-f169.google.com with SMTP id s7so3359793lbd.0 for ; Sun, 17 Aug 2014 07:52:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:mail-followup-to :references:mime-version:content-type:content-disposition :content-transfer-encoding:in-reply-to:user-agent; bh=KgsSmmtsQW129B+GOjTcU5ln0byYKGpxFwQg3XansU8=; b=ZK77mR/i3rd1yiATY76LdZ2Cm2ZOlcmSGyic1KcsWO5AAIQNKLYs+e2xk1cB3zreIi 3ClAMLuC2nZi9AaZbLsJyakZ0vP6mxy51d9ZQqL6m2qFPCqhLV8WRW5Z5JYYtUmssKBI Bco0GcuMpJF6CxFnOspHlj/WffWNQxdYZohTZcngQLIt/RT0lydaCWEVbd41Pv96Iy02 P/8z7Ov1BFs7LiA62KRx2FdqJ7hZxu936u9M/lCaioSPRYDPwSAuCcf7EjUIcrkr5MrF mckGsvQcGl9OTebNgUnu2dPuPKHfPmzesr1C2tDUl7yzoE77QHWgzasjVpkfYWwmbPGg 3h8Q== X-Received: by 10.112.52.225 with SMTP id w1mr23001264lbo.44.1408287137869; Sun, 17 Aug 2014 07:52:17 -0700 (PDT) Received: from pc5.home (abpi45.neoplus.adsl.tpnet.pl. [83.8.50.45]) by mx.google.com with ESMTPSA id yn1sm22592200lbb.25.2014.08.17.07.52.16 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 17 Aug 2014 07:52:17 -0700 (PDT) Sender: =?UTF-8?Q?Edward_Tomasz_Napiera=C5=82a?= Date: Sun, 17 Aug 2014 16:52:17 +0200 From: Edward Tomasz =?utf-8?Q?Napiera=C5=82a?= To: "O. Hartmann" Subject: Re: [CFT] Autofs. Message-ID: <20140817145217.GB5497@pc5.home> Mail-Followup-To: "O. Hartmann" , Hans Ottevanger , freebsd-current@FreeBSD.org, freebsd-arch@FreeBSD.org References: <20140730071933.GA20122@pc5.home> <53F0878E.3000401@beastielabs.net> <20140817152254.1e2786db.ohartman@zedat.fu-berlin.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20140817152254.1e2786db.ohartman@zedat.fu-berlin.de> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-current@FreeBSD.org, Hans Ottevanger , freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Aug 2014 14:52:21 -0000 On 0817T1522, O. Hartmann wrote: > Am Sun, 17 Aug 2014 12:44:30 +0200 > Hans Ottevanger schrieb: > > > On 07/30/14 09:19, Edward Tomasz Napierała wrote: > > > At the link below you will find a patch that adds the new automounter. > > > The patch is against yesterdays 11.0-CURRENT. > > > > > > http://people.freebsd.org/~trasz/autofs-head-20140729.diff > > > > > > Slides that explain the project scope and deliverables are here: > > > > > > http://people.freebsd.org/~trasz/autofs.pdf > > > > > > Testing is welcome. Please start with manual pages, eg. automount(8). > > > Note that you need not only to rebuild both kernel and world, but also > > > to run mergemaster, to install required /etc files. To run at startup, > > > add 'autofs_enable="YES"' to /etc/rc.conf. > > > > > > This project is being sponsored by FreeBSD Foundation. > > > > > > > Hi! > > > > Great to see a real autofs finally coming to FreeBSD. > > > > I already did some very cursory testing on a recent 11-CURRENT system > > that I still happened to have and things with at least the /net map look > > quite OK. > > > > I could do some more extensive testing if I could use some of my > > 10-STABLE systems. I already checked that the patch applies cleanly to a > > recent 10-STABLE (modulo a few offsets) and that both buildworld and > > buildkernel succeed. Should I expect difficulties actually running your > > autofs on 10-STABLE? > > > > And do you plan support for NIS? I know NIS is quite dead and has been > > so for at least 20 years, but I still see it being used occasionally > > (probably most out of habit) and it is (still ?) available in the > > base-system. > > > > Kind regards, > > > > Hans > > > > _______________________________________________ > > freebsd-current@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-current > > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" > > Is this "new" autofs of the same type and concept as the autofs used in Linux for more > than a decade now? Yes. From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 06:54:25 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A5078738; Mon, 18 Aug 2014 06:54:25 +0000 (UTC) Received: from mailout05.t-online.de (mailout05.t-online.de [194.25.134.82]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mailout00.t-online.de", Issuer "TeleSec ServerPass DE-1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 39AFF3276; Mon, 18 Aug 2014 06:54:25 +0000 (UTC) Received: from fwd33.aul.t-online.de (fwd33.aul.t-online.de [172.20.27.144]) by mailout05.t-online.de (Postfix) with SMTP id C59C046F5AD; Mon, 18 Aug 2014 08:54:16 +0200 (CEST) Received: from [192.168.119.33] (XRes1UZOYhfIVQzekWzmQNycC9jB7ZSj-I3wVmnK-xvkQbjr-ke7bYQq8f8AFS0gWg@[84.154.101.219]) by fwd33.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-SHA encrypted) esmtp id 1XJGpY-1anVSa0; Mon, 18 Aug 2014 08:54:12 +0200 Message-ID: <53F1A311.4080707@freebsd.org> Date: Mon, 18 Aug 2014 08:54:09 +0200 From: Stefan Esser User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: Phil Shafer , Alfred Perlstein Subject: Re: XML Output: libxo - provide single API to output TXT, XML, JSON and HTML References: <201408151613.s7FGDMmt003567@idle.juniper.net> In-Reply-To: <201408151613.s7FGDMmt003567@idle.juniper.net> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-ID: XRes1UZOYhfIVQzekWzmQNycC9jB7ZSj-I3wVmnK-xvkQbjr-ke7bYQq8f8AFS0gWg X-TOI-MSGID: 3b53171b-f196-43fa-9020-dc778cab534f Cc: Marcel Moolenaar , John-Mark Gurney , "Simon J. Gerraty" , "arch@freebsd.org" , Poul-Henning Kamp , Konstantin Belousov , Marcel Moolenaar X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 06:54:25 -0000 Am 15.08.2014 um 18:13 schrieb Phil Shafer: > Alfred Perlstein writes: >> Can someone explain an actual use case here that makes sense? > > In JUNOS, we support a NETCONF API, allowing NETCONF RPCs (in XML) > to get hierarchical data back (in XML). We use this to automate > management of our devices. When we parse RPCs, we construct command > lines that are invoked. > > For example the "show interfaces terse" command in in the CLI is > available as the RPC with the > option. The JUNOS CLI parses either of these into the comand line > "ifinfo -b". > > We currently are told which commands support XML output and which > don't. For those that do, we simply forward the command's output > to the client. For those that don't we wrap the output in an XML > tag that means "we don't support this in XML yet, but here's the > text" (and escape the data). Is it possible to introduce a "xo" command which takes a command line as an argument (in the same way as e.g. "time"). A sample usage could be "xo ls -s", which should invoke "ls -l" with its output converted to XML (and "xo -json ls -l" could produce JSON output). This command is meant to decouple the request for XO support from the method that checks for XO support and enables it. If "xo" determines, that "ls" cannot produce structured output, it executes it as a sub-command and wraps the output in the way you describe. This may not be parseable by a following command in a pipe, and you could add an "-f" option to "xo" that checks for XO support and makes the command fail if it is not supported (instead of wrapping up the result). The downside is the extra process invocation required for "xo", but you could use any of the suggested methods to check for and enable support of XO in programs, and you could change that method at a later time without breaking existing scripts. Methods discussed so far are e.g.: - add long option as ARGV[1] (e.g. "--libxo-is-supported") - use command name prefix ("xo-$CMD" linked to the actual $CMD) - test for and use different standard file descriptors (XO_STDIN, XO_STDOUT, and XO_STDERR) if supported by the command (I have probably forgotten a few ...) If you go for "xo [options] cmd", any of the above mechanisms can be used and the actual method can be changed at a later time. And further options (e.g. to control the output format - XML vs. JSON, for example) could also be passed by any method (e.g via an environment variable checked by libxo). Anyway: While the command syntax is not important, it should be stable. And that's what an "xo" command could provide ... Regards, STefan From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 08:26:53 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D173DC60; Mon, 18 Aug 2014 08:26:53 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 565513A46; Mon, 18 Aug 2014 08:26:53 +0000 (UTC) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id s7I8QkJr080417 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 18 Aug 2014 11:26:46 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua s7I8QkJr080417 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id s7I8QkSH080416; Mon, 18 Aug 2014 11:26:46 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 18 Aug 2014 11:26:46 +0300 From: Konstantin Belousov To: Mateusz Guzik Subject: Re: [PATCH 1/2] Implement simple sequence counters with memory barriers. Message-ID: <20140818082646.GL2737@kib.kiev.ua> References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> <1408064112-573-2-git-send-email-mjguzik@gmail.com> <20140816093811.GX2737@kib.kiev.ua> <20140816185406.GD2737@kib.kiev.ua> <20140817012646.GA21025@dft-labs.eu> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="HlXFiQcSFG/a+HqU" Content-Disposition: inline In-Reply-To: <20140817012646.GA21025@dft-labs.eu> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home Cc: Johan Schuijt , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 08:26:53 -0000 --HlXFiQcSFG/a+HqU Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Aug 17, 2014 at 03:26:47AM +0200, Mateusz Guzik wrote: > On Sat, Aug 16, 2014 at 09:54:06PM +0300, Konstantin Belousov wrote: > > On Sat, Aug 16, 2014 at 12:38:11PM +0300, Konstantin Belousov wrote: > > > On Fri, Aug 15, 2014 at 02:55:11AM +0200, Mateusz Guzik wrote: > > > > --- > > > > sys/sys/seq.h | 126 ++++++++++++++++++++++++++++++++++++++++++++++= ++++++++++++ > > > > 1 file changed, 126 insertions(+) > > > > create mode 100644 sys/sys/seq.h > > > >=20 > > > > diff --git a/sys/sys/seq.h b/sys/sys/seq.h > > > > new file mode 100644 > > > > index 0000000..0971aef > > > > --- /dev/null > > > > +++ b/sys/sys/seq.h > [..] > > > > +#ifndef _SYS_SEQ_H_ > > > > +#define _SYS_SEQ_H_ > > > > + > > > > +#ifdef _KERNEL > > > > + > > > > +/* > > > > + * Typical usage: > > > > + * > > > > + * writers: > > > > + * lock_exclusive(&obj->lock); > > > > + * seq_write_begin(&obj->seq); > > > > + * ..... > > > > + * seq_write_end(&obj->seq); > > > > + * unlock_exclusive(&obj->unlock); > > > > + * > > > > + * readers: > > > > + * obj_t lobj; > > > > + * seq_t seq; > > > > + * > > > > + * for (;;) { > > > > + * seq =3D seq_read(&gobj->seq); > > > > + * lobj =3D gobj; > > > > + * if (seq_consistent(&gobj->seq, seq)) > > > > + * break; > > > > + * cpu_spinwait(); > > > > + * } > > > > + * foo(lobj); > > > > + */ =09 > > > > + > > > > +typedef uint32_t seq_t; > > > > + > > > > +/* A hack to get MPASS macro */ > > > > +#include > > > > +#include > > > > + > > > > +#include > > > > + > > > > +static __inline bool > > > > +seq_in_modify(seq_t seqp) > > > > +{ > > > > + > > > > + return (seqp & 1); > > > > +} > > > > + > > > > +static __inline void > > > > +seq_write_begin(seq_t *seqp) > > > > +{ > > > > + > > > > + MPASS(!seq_in_modify(*seqp)); > > > > + (*seqp)++; > > > > + wmb(); > > > This probably ought to be written as atomic_add_rel_int(seqp, 1); > > Alan Cox rightfully pointed out that better expression is > > v =3D *seqp + 1; = =20 > > atomic_store_rel_int(seqp, v); > > which also takes care of TSO on x86. > >=20 >=20 > Well, my memory-barrier-and-so-on-fu is rather weak. >=20 > I had another look at the issue. At least on amd64, it looks like only > compiler barrier is required for both reads and writes. >=20 > According to AMD64 Architecture Programmer???s Manual Volume 2: System > Programming, 7.2 Multiprocessor Memory Access Ordering states: >=20 > "Loads do not pass previous loads (loads are not reordered). Stores do > not pass previous stores (stores are not reordered)" >=20 > Since the code modifying stuff only performs a series of writes and we > expect exclusive writers, I find it applicable to this scenario. I agree. >=20 > I checked linux sources and generated assembly, they indeed issue only > a compiler barrier on amd64 (and for intel processors as well). >=20 > atomic_store_rel_int on amd64 seems fine in this regard, but the only > function for loads issues lock cmpxhchg which kills performance > (median 55693659 -> 12789232 ops in a microbenchmark) for no gain. >=20 > Additionally release and acquire semantics seems to be a stronger than > needed guarantee. >=20 > As far as sequence counters go, we should be able to get away with > making the following: > - all relevant reads are performed between given points > - all relevant writes are performed between given points >=20 > As such, I propose introducing another atomic_* function variants > (or stealing smp_{w,r,}mb idea from linux) which provide just that. >=20 > So for amd64 reading guarantee and writing guarantee could be provided > in the same way with a compiler barrier. I think even this could be nicely done in the ia64 style of acq/rel. >=20 > > > Same note for all other linux-style barriers. In fact, on x86 > > > wmb() is sfence and it serves no useful purpose in seq_write*. > > >=20 > > > Overall, it feels too alien and linux-ish for my taste. > > > Since we have sequence bound to some lock anyway, could we introduce > > > some sort of generation-aware locks variants, which extend existing > > > locks, and where lock/unlock bump generation number ? > > Still, merging it to the guts of lock implementation is right > > approach, IMO. > >=20 >=20 > Current usage would be along with filedesc (sx) lock. The lock protects > writes to entire fd table (and lock holders can block in malloc), while > each file descriptor has its own counter. Also areas covered by seq are > short and cannot block. >=20 > As such, I don't really see any way to merge the lock with the counter. Ok, I recall my proposal. >=20 > I agree it would be useful, provided area protected by the lock would be > the same as the one protected by the counter. If this code hits the tree > and one day turns out someone needs such functionality, there should not > be any problems (apart from time effort) in implementing this. >=20 > --=20 > Mateusz Guzik --HlXFiQcSFG/a+HqU Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBAgAGBQJT8bjGAAoJEJDCuSvBvK1BuUMP/0rf9cbKqSq8iHnGKIS2ORmZ Kmt2SMZSEqIEqR/RaVIwvvsCldgV7j2IYHIf74OFaQ/stPWSEJd8ftsDVhylCEIE XlMrW9W3BjsG224MMpsWXX30dm/iCfPBvKMl9ujJgEY7zpPCUgCIzu9QppJLJhxK Tk+zLu6fqT8ups7lsQkJLGS1ZhrWTGAQLmvFlGUsTI5lq0yQjKXzgeYLadP29ntx 7q2QbIX1AN7oV/KvM4GpjSmDuUnvpU5OntCcGFtvycX791A8KIhjBIKsZxqE3Snp Uw6ACdbOfT3i93AkFbM0kx8tSyzyozL6LTUaxPRG9A/H/7NNlivyUh+Ci5QFZtC3 i/BkRY9ty8cisq95EbJm23DtRNWxKq7GsXD/jOudv4BLIZA5T3HXEVNrjeEkFtZn 6EuD8PWh8WHkpxBqIgKy6ZxmxtDGc94ux+ECno5KOiV55hko9nLisgwdsPuAJA9U WVU989GSpkBrxtmrorDbz7LFmyUVWQ2aY2LBTT3Noy+fukNxLDgsXrjCqrh5YEZW AjrrmS865vIov0OE+3B7Y2qe140838dLbC00+sUAI7GHBd4/1DZL29BirMJoYAYk V1a/lNPxhtf8iQcQIGiLI4vbYd1OjXjESiCRAkezUArY/5kRD+3ORQiyto2Uilih fuEIwgfMENZVp8yi7rDt =9fvJ -----END PGP SIGNATURE----- --HlXFiQcSFG/a+HqU-- From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 13:11:51 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 1BFDAFA4; Mon, 18 Aug 2014 13:11:51 +0000 (UTC) Received: from na01-by2-obe.outbound.protection.outlook.com (mail-by2lp0243.outbound.protection.outlook.com [207.46.163.243]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "MSIT Machine Auth CA 2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A9F7B3737; Mon, 18 Aug 2014 13:11:49 +0000 (UTC) Received: from BY2PR05CA030.namprd05.prod.outlook.com (10.141.250.20) by DM2PR05MB736.namprd05.prod.outlook.com (10.141.178.25) with Microsoft SMTP Server (TLS) id 15.0.1010.18; Mon, 18 Aug 2014 13:11:46 +0000 Received: from BY2FFO11FD058.protection.gbl (2a01:111:f400:7c0c::107) by BY2PR05CA030.outlook.office365.com (2a01:111:e400:2c5f::20) with Microsoft SMTP Server (TLS) id 15.0.1010.18 via Frontend Transport; Mon, 18 Aug 2014 13:11:46 +0000 Received: from P-EMF01-SAC.jnpr.net (66.129.239.15) by BY2FFO11FD058.mail.protection.outlook.com (10.1.15.178) with Microsoft SMTP Server (TLS) id 15.0.1010.11 via Frontend Transport; Mon, 18 Aug 2014 13:11:46 +0000 Received: from magenta.juniper.net (172.17.27.123) by P-EMF01-SAC.jnpr.net (172.24.192.21) with Microsoft SMTP Server (TLS) id 14.3.146.0; Mon, 18 Aug 2014 06:11:45 -0700 Received: from idle.juniper.net (idleski.juniper.net [172.25.4.26]) by magenta.juniper.net (8.11.3/8.11.3) with ESMTP id s7IDBbn83317; Mon, 18 Aug 2014 06:11:38 -0700 (PDT) (envelope-from phil@juniper.net) Received: from idle.juniper.net (localhost [127.0.0.1]) by idle.juniper.net (8.14.4/8.14.3) with ESMTP id s7IDBRtD018629; Mon, 18 Aug 2014 09:11:27 -0400 (EDT) (envelope-from phil@idle.juniper.net) Message-ID: <201408181311.s7IDBRtD018629@idle.juniper.net> To: Stefan Esser Subject: Re: XML Output: libxo - provide single API to output TXT, XML, JSON and HTML In-Reply-To: <53F1A311.4080707@freebsd.org> Date: Mon, 18 Aug 2014 09:11:27 -0400 From: Phil Shafer MIME-Version: 1.0 Content-Type: text/plain X-EOPAttributedMessage: 0 X-Forefront-Antispam-Report: CIP:66.129.239.15; CTRY:US; IPV:NLI; IPV:NLI; EFV:NLI; SFV:NSPM; SFS:(6009001)(199003)(164054003)(189002)(20776003)(64706001)(47776003)(79102001)(76482001)(15202345003)(87936001)(68736004)(76506005)(92566001)(69596002)(84676001)(46102001)(53416004)(77982001)(15975445006)(92726001)(80022001)(6806004)(21056001)(86362001)(102836001)(85306004)(44976005)(4396001)(97736001)(103666002)(83322001)(19580395003)(83072002)(50466002)(54356999)(50986999)(81342001)(81542001)(99396002)(107046002)(48376002)(106466001)(31966008)(74502001)(81156004)(105596002)(110136001)(95666004)(74662001); DIR:OUT; SFP:; SCL:1; SRVR:DM2PR05MB736; H:P-EMF01-SAC.jnpr.net; FPR:; MLV:sfv; PTR:InfoDomainNonexistent; A:1; MX:1; LANG:en; X-Microsoft-Antispam: BCL:0;PCL:0;RULEID:;UriScan:; X-Forefront-PRVS: 03077579FF Received-SPF: SoftFail (protection.outlook.com: domain of transitioning juniper.net discourages use of 66.129.239.15 as permitted sender) Authentication-Results: spf=softfail (sender IP is 66.129.239.15) smtp.mailfrom=phil@juniper.net; X-OriginatorOrg: juniper.net Cc: Marcel Moolenaar , John-Mark Gurney , Alfred Perlstein , "Simon J. Gerraty" , "arch@freebsd.org" , Poul-Henning Kamp , Konstantin Belousov , Marcel Moolenaar X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 13:11:51 -0000 Stefan Esser writes: >Is it possible to introduce a "xo" command which takes a command >line as an argument (in the same way as e.g. "time"). A sample >usage could be "xo ls -s", which should invoke "ls -l" with its >output converted to XML (and "xo -json ls -l" could produce JSON >output). I've implemented the "--libxo" option, in a function called xo_parse_args(), that it called before getopt* and processes and removes libxo options. See the example on: http://juniper.github.io/libxo/libxo-manual.html FWIW, there's an "xo" command packaged with libxo that perform similar to the printf(1) command: % xo --wrap top/data 'My {:pet} is {:age} years old\n' dog 2 My dog is 2 years old % xo --xml --pretty --wrap top/data 'My {:pet} is {:age} years old\n' dog 2 dog 2 Thanks, Phil From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 15:03:22 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E44C8D4E; Mon, 18 Aug 2014 15:03:22 +0000 (UTC) Received: from mail.ipfw.ru (mail.ipfw.ru [IPv6:2a01:4f8:120:6141::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A923831E8; Mon, 18 Aug 2014 15:03:22 +0000 (UTC) Received: from [2a02:6b8:0:401:222:4dff:fe50:cd2f] (helo=ptichko.yndx.net) by mail.ipfw.ru with esmtpsa (TLSv1:DHE-RSA-AES128-SHA:128) (Exim 4.82 (FreeBSD)) (envelope-from ) id 1XJKVJ-0009pe-AL; Mon, 18 Aug 2014 14:49:33 +0400 Message-ID: <53F215A9.8010708@FreeBSD.org> Date: Mon, 18 Aug 2014 19:03:05 +0400 From: "Alexander V. Chernikov" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: arch@freebsd.org Subject: superpages for UMA Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Andrey V. Elsukov" , Gleb Smirnoff X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 15:03:23 -0000 Hello list. Currently UMA(9) uses PAGE_SIZE kegs to store items in. It seems fine for most usage scenarios, however there are some where very large number of items is required. I've run into this problem while using ipfw tables (radix based) with ~50k records. This is how `pmcstat -TS DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK -w1` looks like: PMC: [DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK] Samples: 2359 (100.0%) , 0 unresolved %SAMP IMAGE FUNCTION CALLERS 28.7 kernel rn_match ipfw_lookup_table:21.7 rtalloc_fib_nolock:7.0 25.5 ipfw.ko ipfw_chk ipfw_check_hook 6.0 kernel rn_lookup ipfw_lookup_table Some numbers: table entry occupies 128 bytes, so we may store no more than 30 records in single page-sized keg. 50k records require more than 1500 kegs. As far as I understand second-level TLB for modern Intel CPU may be 256 or 512 entries( for 4K pages ), so using large number of entries results in TLB cache misses constantly happening. Other examples: Route tables (in current implementation): struct rte occupies more than 128 bytes and storing full-view (> 500k routes) would result in TLB misses happening all of the time. Various stateful packet processing: modern SLB/firewall can have millions of states. Regardless of state size PAGE_SIZE'd kegs is not the best choice. All of these can be addressed: Ipwa tables/ipfw dynamic state allocation code can (and will) be rewritten to use uma+uma_zone_set_allocf (suggested by glebius), radix should simply be changed to a different lookup algo (as it is happening in ipfw tables). However, we may consider on adding another UMA flag to allocate 2M/1G-sized kegs per request. (Additionally, Intel Haswell arch has 512 entries in STLB shared? between 4k/2M so it should help the former). What do you think? From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 17:36:51 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8E3D4F56; Mon, 18 Aug 2014 17:36:51 +0000 (UTC) Received: from dmz-mailsec-scanner-3.mit.edu (dmz-mailsec-scanner-3.mit.edu [18.9.25.14]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 893F73FF9; Mon, 18 Aug 2014 17:36:50 +0000 (UTC) X-AuditID: 1209190e-f79946d000007db1-65-53f239ab47e0 Received: from mailhub-auth-2.mit.edu ( [18.7.62.36]) (using TLS with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by dmz-mailsec-scanner-3.mit.edu (Symantec Messaging Gateway) with SMTP id 9C.D4.32177.BA932F35; Mon, 18 Aug 2014 13:36:43 -0400 (EDT) Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) by mailhub-auth-2.mit.edu (8.13.8/8.9.2) with ESMTP id s7IHagTV027334; Mon, 18 Aug 2014 13:36:42 -0400 Received: from multics.mit.edu (system-low-sipb.mit.edu [18.187.2.37]) (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU) by outgoing.mit.edu (8.13.8/8.12.4) with ESMTP id s7IHaeq1028756 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Mon, 18 Aug 2014 13:36:41 -0400 Received: (from kaduk@localhost) by multics.mit.edu (8.12.9.20060308) id s7IHaeOC014040; Mon, 18 Aug 2014 13:36:40 -0400 (EDT) Date: Mon, 18 Aug 2014 13:36:39 -0400 (EDT) From: Benjamin Kaduk To: Stefan Esser Subject: Re: XML Output: libxo - provide single API to output TXT, XML, JSON and HTML In-Reply-To: <53F1A311.4080707@freebsd.org> Message-ID: References: <201408151613.s7FGDMmt003567@idle.juniper.net> <53F1A311.4080707@freebsd.org> User-Agent: Alpine 1.10 (GSO 962 2008-03-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrPIsWRmVeSWpSXmKPExsUixG6norva8lOwwdKd4hZLZsxjtlhyZj27 xYw7T1gcmD1mfJrP4nG96Sp7AFMUl01Kak5mWWqRvl0CV8bvJV+ZC5pZKxYs+cfSwPiPuYuR k0NCwETi5/QTLBC2mMSFe+vZuhi5OIQEZjNJNPYsZYFwNjJKnFlzhwnCOcQkcfFnF1RZA6PE 9N8b2UD6WQS0JZ6du8cKYrMJqEjMfAMRFxFQlFgw6SATiM0s4Clx4eljRhBbWCBc4uasqWD1 nEC9R5dvZwexeQUcJc6vXAdmCwlESjxs+A7WKyqgI7F6/xQWiBpBiZMzn7BAzNSSWD59G8sE RsFZSFKzkKQWMDKtYpRNya3SzU3MzClOTdYtTk7My0st0jXWy80s0UtNKd3ECApbTkm+HYxf DyodYhTgYFTi4T358WOwEGtiWXFl7iFGSQ4mJVFeZYNPwUJ8SfkplRmJxRnxRaU5qcWHGCU4 mJVEeBNMgXK8KYmVValF+TApaQ4WJXHet9ZWwUIC6YklqdmpqQWpRTBZGQ4OJQneYAugRsGi 1PTUirTMnBKENBMHJ8hwHqDhN8CGFxck5hZnpkPkTzHqcrQ0ve1lEmLJy89LlRLnPWQOVCQA UpRRmgc3B5ZuXjGKA70lzFsJso4HmKrgJr0CWsIEtGTr4o8gS0oSEVJSDYyLy4P2bZ+gHrD4 167nC9aIaX1xfHJ3l/m0Ba92Jk0yWBI14ekO4ZA7FZO8eDsnTpecHp02u8HwZltg5g2Gx3LM l3+5TJUSWBP+/URZx8a1b37Ps2czz7Iw+M/ic0Os7H36uaAdtnVimQI97yYY+Wrsk7128MeF pk599yYPo+5I4ex7LP9CdP2VWIozEg21mIuKEwGAmSADEgMAAA== Cc: "arch@freebsd.org" , Phil Shafer X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 17:36:51 -0000 On Mon, 18 Aug 2014, Stefan Esser wrote: > Methods discussed so far are e.g.: > > - add long option as ARGV[1] (e.g. "--libxo-is-supported") > > - use command name prefix ("xo-$CMD" linked to the actual $CMD) > > - test for and use different standard file descriptors (XO_STDIN, > XO_STDOUT, and XO_STDERR) if supported by the command > > (I have probably forgotten a few ...) It seems prudent to consider how well such mechanisms would play with other libraries attempting to perform similar tricks with regard to detecting functionality. E.g., the "xo-" prefix can really only be used by one library at a time. -Ben From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 18:39:31 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 37D0DDAD; Mon, 18 Aug 2014 18:39:31 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B0AA03698; Mon, 18 Aug 2014 18:39:30 +0000 (UTC) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id s7IIdPeD099532 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 18 Aug 2014 21:39:25 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua s7IIdPeD099532 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id s7IIdP4g099531; Mon, 18 Aug 2014 21:39:25 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 18 Aug 2014 21:39:25 +0300 From: Konstantin Belousov To: "Alexander V. Chernikov" Subject: Re: superpages for UMA Message-ID: <20140818183925.GP2737@kib.kiev.ua> References: <53F215A9.8010708@FreeBSD.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="y8hmAOsilT9lKboI" Content-Disposition: inline In-Reply-To: <53F215A9.8010708@FreeBSD.org> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home Cc: arch@freebsd.org, Gleb Smirnoff , "Andrey V. Elsukov" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 18:39:31 -0000 --y8hmAOsilT9lKboI Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Aug 18, 2014 at 07:03:05PM +0400, Alexander V. Chernikov wrote: > Hello list. >=20 > Currently UMA(9) uses PAGE_SIZE kegs to store items in. > It seems fine for most usage scenarios, however there are some where=20 > very large number of items is required. >=20 > I've run into this problem while using ipfw tables (radix based) with=20 > ~50k records. This is how > `pmcstat -TS DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK -w1` looks like: > PMC: [DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK] Samples: 2359 (100.0%) , 0=20 > unresolved >=20 > %SAMP IMAGE FUNCTION CALLERS > 28.7 kernel rn_match ipfw_lookup_table:21.7=20 > rtalloc_fib_nolock:7.0 > 25.5 ipfw.ko ipfw_chk ipfw_check_hook > 6.0 kernel rn_lookup ipfw_lookup_table >=20 > Some numbers: table entry occupies 128 bytes, so we may store no more=20 > than 30 records in single page-sized keg. > 50k records require more than 1500 kegs. > As far as I understand second-level TLB for modern Intel CPU may be 256= =20 > or 512 entries( for 4K pages ), so using large number of entries > results in TLB cache misses constantly happening. >=20 > Other examples: > Route tables (in current implementation): struct rte occupies more than= =20 > 128 bytes and storing full-view (> 500k routes) would result in TLB=20 > misses happening all of the time. > Various stateful packet processing: modern SLB/firewall can have=20 > millions of states. Regardless of state size PAGE_SIZE'd kegs is not the= =20 > best choice. >=20 > All of these can be addressed: > Ipwa tables/ipfw dynamic state allocation code can (and will) be=20 > rewritten to use uma+uma_zone_set_allocf (suggested by glebius), > radix should simply be changed to a different lookup algo (as it is=20 > happening in ipfw tables). >=20 > However, we may consider on adding another UMA flag to allocate=20 > 2M/1G-sized kegs per request. > (Additionally, Intel Haswell arch has 512 entries in STLB shared?=20 > between 4k/2M so it should help the former). >=20 > What do you think? >=20 Zones with small object sizes use uma_small_alloc() to request physical page and its KVA mapping. On amd64, uma_small_alloc() allocates a physical page and returns direct mapping address for the page. The direct map is done by large pages (2MB, 1GB if avaliable). In this sense, your allocations already use large pages for virtual memory translations. Zones are not local in the KVA, i.e. objects from the same zone are usually far apart in the KVA. Zones do not get dedicated submaps to contain the zone-owned pages. Note that large pages TLB is usually relatively small. E.g. on my Nehalem machine, it only has 32 entries which can hold 2MB pages, which results in the 64MB of cached address space translations in the best case. You might try to reduce the available memory to see the increased locality and better DTLB hit ratio, if your load can survive with lesser memory size. --y8hmAOsilT9lKboI Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBAgAGBQJT8khdAAoJEJDCuSvBvK1BhjQP/R565J1uLGZorgaLL9g8Vmkb 2+NsiNyxtRqEUkOQu5mvtuJrRFfHhshQlnyu1mya5710Y4JndIGsUKiiSSot/zSe 81833zvmOWE0MKJ7vVLH7Iw/PgOM+7obWm7QxuiLgLrOW/HJOdwZWABm0dw1zdIU eu249sF4F4OhRzxBilV5jCb2m8iIRc90St07eBz+441p3xR+ZgVpBQAlQiODAV+j 4CpxpxQrvBWqhdCOKISnKMiOi2rIx4NUz5SdVXF3EjfvV40WWkMuwSnTc4jNMO7p qY53ChGfcKsfx2CKwpzfrSPZ8wStk5s1hmryoCHEIffzyKRrnQ5Yy+ksOT+fFoe3 OW5GSbDKE+3pgEsPqwuuLhLciX1rZ9LWFoCesciVWqh9er5n3CT5XjllN3wFRGyb s79uUsBBc4Yk+mowgyzwtGZTzIZTLtXkkVochHwDCRB5IhvWFWWyJ0heVN/mwaI3 3KlmN5JMsv+XXGO0WV/h8qVdIzlvXzbmZqXeuLoX7YbRvpjyckxsAG1UJGqTDNPx nsCZwLZqpb7oJ0xXvdkbj1Gl3P35sa4YVNaPiY2T9JwdyWMQ88hz2U+D7xr4zw1E HFFFka76CUWIKoInOW54vQOZhAayq24Sy7hUJeq01Zd+GCFHfo1Kahs0mG0jPtPU ZBlEZoHQzvXyj49i/fiq =K3wE -----END PGP SIGNATURE----- --y8hmAOsilT9lKboI-- From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 19:45:27 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4966461B; Mon, 18 Aug 2014 19:45:27 +0000 (UTC) Received: from mail-qc0-x230.google.com (mail-qc0-x230.google.com [IPv6:2607:f8b0:400d:c01::230]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id CB8B73E3C; Mon, 18 Aug 2014 19:45:26 +0000 (UTC) Received: by mail-qc0-f176.google.com with SMTP id m20so5351593qcx.35 for ; Mon, 18 Aug 2014 12:45:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=JaebXXZUm9b4PmO3YqhfcQsFOSMkpoJPDSei3E/JaWo=; b=LV6sTVs2D9YIIjyXUmBIR1d/M3yrVo8Av+W1eRZVylLvNZrIWcndMvz5Cv9qr3+K+8 bMprjgLqrcPLmGInynn/HO3qWd3uH8RLJAg8CM/DovlSup1kSJQ2Ac98ISmuY6ML+qR4 BJW6pY+0zrQ3dANRZsJ6mWSC2MZnYMocApM0wgzNslgwwbUg42PG+mD5ELZeh/BQvcKE Yt1dk5ZxxLvneTI5zDqHPoL2SubrZD2wAujtcHUkKt5L8R/cAkh2hhbJdjaoSGm3w5+B AI7HSzS3+iV3DxXdMT9/vjjWc72ozHaPNcnsZjADgSP6RCqR1Y1eNBNxq62MCyctrRl2 7W+Q== MIME-Version: 1.0 X-Received: by 10.140.27.144 with SMTP id 16mr55305116qgx.18.1408391125698; Mon, 18 Aug 2014 12:45:25 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.224.39.139 with HTTP; Mon, 18 Aug 2014 12:45:25 -0700 (PDT) In-Reply-To: <20140818183925.GP2737@kib.kiev.ua> References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua> Date: Mon, 18 Aug 2014 12:45:25 -0700 X-Google-Sender-Auth: mTqPAID1-WA3Y8GAwUvn97dr36g Message-ID: Subject: Re: superpages for UMA From: Adrian Chadd To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Cc: "freebsd-arch@freebsd.org" , Gleb Smirnoff , "Alexander V. Chernikov" , "Andrey V. Elsukov" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 19:45:27 -0000 Hi! I dug into this a little bit last year. I saw a lot of time spent just walking TLBs for VM pages when doing a lot of VM page -> network pushing. On the sandy bridge boxes with 1G page entries, the TLB only has 4 entries. The high area of memory isn't 1G aligned, so we don't use 1G pages for all the stuff that's allocated initially. That includes, among other things, all the VM memory that you need. The other thing that crept up was that we don't try to reserve memory in any way - we'll just fragment stuff quickly from the pmap and allocate where we can when we can. So there's currently no attempt to allocate small kernel structures from the same underlying 1G page. That'd be an interesting experiment - allocating VM entries and other small things like rtentry and mbuf UMA entries from one or two 1GB regions of memory. It may make better use of the 1G (or 2M) TLB entries and keep things hot. -a From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 19:48:47 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 934CB763; Mon, 18 Aug 2014 19:48:47 +0000 (UTC) Received: from mail-ie0-x230.google.com (mail-ie0-x230.google.com [IPv6:2607:f8b0:4001:c03::230]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 43BD83E65; Mon, 18 Aug 2014 19:48:47 +0000 (UTC) Received: by mail-ie0-f176.google.com with SMTP id tr6so17594ieb.21 for ; Mon, 18 Aug 2014 12:48:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=84IbjVCLMhvLluqidC+1Sq2vMOVqQN5PnM7dgeoS+S4=; b=ljddjmezHjNN2ZmvERtC6ydVvSyEQZ3uTqoVwVHnY+BV5OlDxhg4I8EzhmRbJ7/NTV mHK7QHoWO+v4g+KbIm+J9BXIUqSgzZYN2qqOjN2hjdFtTH8OZltrtgoVxGDgbcHnlN7V 1hKUuHfqQ0QMJri/iVbYGdp3IJgVdr2AjpOm1BcycbfcDyjZQN6YzuaZftSrTlfv+7hO TjWn6w6NAcDuqJQ2/eH6b0EtUPI5bbrZKmtcWJa8R7K1vvt+nbcWE8Kszxg3lXa8BrSh 9efkENTkZJUDVTBgJfol9zwhoN16lhT8qYPS40f2ObE6qER87MyHlADFIiD6mTiZkP+N bonQ== MIME-Version: 1.0 X-Received: by 10.42.171.138 with SMTP id j10mr3073695icz.75.1408391326660; Mon, 18 Aug 2014 12:48:46 -0700 (PDT) Received: by 10.43.17.196 with HTTP; Mon, 18 Aug 2014 12:48:46 -0700 (PDT) Reply-To: alc@freebsd.org In-Reply-To: <20140818183925.GP2737@kib.kiev.ua> References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua> Date: Mon, 18 Aug 2014 14:48:46 -0500 Message-ID: Subject: Re: superpages for UMA From: Alan Cox To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: arch@freebsd.org, Gleb Smirnoff , "Alexander V. Chernikov" , "Andrey V. Elsukov" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 19:48:47 -0000 On Mon, Aug 18, 2014 at 1:39 PM, Konstantin Belousov wrote: > On Mon, Aug 18, 2014 at 07:03:05PM +0400, Alexander V. Chernikov wrote: > > Hello list. > > > > Currently UMA(9) uses PAGE_SIZE kegs to store items in. > > It seems fine for most usage scenarios, however there are some where > > very large number of items is required. > > > > I've run into this problem while using ipfw tables (radix based) with > > ~50k records. This is how > > `pmcstat -TS DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK -w1` looks like: > > PMC: [DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK] Samples: 2359 (100.0%) , 0 > > unresolved > > > > %SAMP IMAGE FUNCTION CALLERS > > 28.7 kernel rn_match ipfw_lookup_table:21.7 > > rtalloc_fib_nolock:7.0 > > 25.5 ipfw.ko ipfw_chk ipfw_check_hook > > 6.0 kernel rn_lookup ipfw_lookup_table > > > > Some numbers: table entry occupies 128 bytes, so we may store no more > > than 30 records in single page-sized keg. > > 50k records require more than 1500 kegs. > > As far as I understand second-level TLB for modern Intel CPU may be 256 > > or 512 entries( for 4K pages ), so using large number of entries > > results in TLB cache misses constantly happening. > > > > Other examples: > > Route tables (in current implementation): struct rte occupies more than > > 128 bytes and storing full-view (> 500k routes) would result in TLB > > misses happening all of the time. > > Various stateful packet processing: modern SLB/firewall can have > > millions of states. Regardless of state size PAGE_SIZE'd kegs is not the > > best choice. > > > > All of these can be addressed: > > Ipwa tables/ipfw dynamic state allocation code can (and will) be > > rewritten to use uma+uma_zone_set_allocf (suggested by glebius), > > radix should simply be changed to a different lookup algo (as it is > > happening in ipfw tables). > > > > However, we may consider on adding another UMA flag to allocate > > 2M/1G-sized kegs per request. > > (Additionally, Intel Haswell arch has 512 entries in STLB shared? > > between 4k/2M so it should help the former). > > > > What do you think? > > > Zones with small object sizes use uma_small_alloc() to request physical > page and its KVA mapping. On amd64, uma_small_alloc() allocates a > physical page and returns direct mapping address for the page. The > direct map is done by large pages (2MB, 1GB if avaliable). In this > sense, your allocations already use large pages for virtual memory > translations. > > Zones are not local in the KVA, i.e. objects from the same zone are > usually far apart in the KVA. Zones do not get dedicated submaps to > contain the zone-owned pages. > > Note that large pages TLB is usually relatively small. E.g. on my > Nehalem machine, it only has 32 entries which can hold 2MB pages, > which results in the 64MB of cached address space translations in > the best case. You might try to reduce the available memory to > see the increased locality and better DTLB hit ratio, if your load > can survive with lesser memory size. > Newer Intel CPUs have more entries, and AMD CPUs have long (since Barcelona) had more. In particular, they allow 2 MB page mappings to be cached in a larger L2 TLB. Nowadays, the trouble is with the 1 GB pages. A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB pages. It might make sense to increase the largest size used by the buddy allocator in vm_phys.c to 1 GB. Then, the VM_FREEPOOL_DIRECT mechanism might help. Back in the days when Opteron TLBs had only 8 2MB entries, I wrote the following in the commit message for r170477: "The twist is that this allocator tries to reduce the number of TLB misses incurred by accesses through a direct map to small, UMA-managed objects and page table pages. Roughly speaking, the physical pages that are allocated for such purposes are clustered together in the physical address space. The performance benefits vary. In the most extreme case, a uniprocessor kernel running on an Opteron, I measured an 18% reduction in system time during a buildworld. From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 19:52:13 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 170E5A9A; Mon, 18 Aug 2014 19:52:13 +0000 (UTC) Received: from mail-ie0-x22d.google.com (mail-ie0-x22d.google.com [IPv6:2607:f8b0:4001:c03::22d]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B02993F13; Mon, 18 Aug 2014 19:52:12 +0000 (UTC) Received: by mail-ie0-f173.google.com with SMTP id tr6so21213ieb.32 for ; Mon, 18 Aug 2014 12:52:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=Qe9GhZoEMzzWfGPQC8PbPUvioN19l1P/k59IRCPkbQ8=; b=oDRBQqNL1uly0yKgisXzKunZWnAlbWynVWY0IxFP+h5fbXKgThmkLGC8EBoTeQvG4h Q8ZaGYkWWUg+4wvtP5EwQkEXY/JZSruoCTCOiseYqldqGtp9vMAVyER9qeBS6PvvRLAi qpm5bpXzRtuzLzLhh66wQl43fyUhrRBAvSgvBKRFMxY5gW4GU59Hz57v2k9LwxMm+RR4 zyILc/p91cxCATheQDxSsjFg9Y8DF0X2zrId0Mwl+qAVlFk9hY4EUp6UvN3Jw6GYqVRf DHynrairRSpAcKKa6iX91worxGstCyfWQo68jexA+Einhxl/O0ZixxpVcdpBljXBsomP WNGQ== MIME-Version: 1.0 X-Received: by 10.43.70.66 with SMTP id yf2mr19814257icb.36.1408391532138; Mon, 18 Aug 2014 12:52:12 -0700 (PDT) Received: by 10.43.17.196 with HTTP; Mon, 18 Aug 2014 12:52:12 -0700 (PDT) Reply-To: alc@freebsd.org In-Reply-To: References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua> Date: Mon, 18 Aug 2014 14:52:12 -0500 Message-ID: Subject: Re: superpages for UMA From: Alan Cox To: Adrian Chadd Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: Konstantin Belousov , "freebsd-arch@freebsd.org" , Gleb Smirnoff , "Alexander V. Chernikov" , "Andrey V. Elsukov" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 19:52:13 -0000 On Mon, Aug 18, 2014 at 2:45 PM, Adrian Chadd wrote: > Hi! > > I dug into this a little bit last year. I saw a lot of time spent just > walking TLBs for VM pages when doing a lot of VM page -> network > pushing. > > On the sandy bridge boxes with 1G page entries, the TLB only has 4 entries. > > The high area of memory isn't 1G aligned, so we don't use 1G pages for > all the stuff that's allocated initially. That includes, among other > things, all the VM memory that you need. > > The other thing that crept up was that we don't try to reserve memory > in any way - we'll just fragment stuff quickly from the pmap and > allocate where we can when we can. So there's currently no attempt to > allocate small kernel structures from the same underlying 1G page. > > For uma_small_alloc(), there is VM_FREEPOOL_DIRECT. However, this is still tuned for 2 MB pages. > That'd be an interesting experiment - allocating VM entries and other > small things like rtentry and mbuf UMA entries from one or two 1GB > regions of memory. It may make better use of the 1G (or 2M) TLB > entries and keep things hot. > > > > -a > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 20:13:33 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2258A1AA; Mon, 18 Aug 2014 20:13:33 +0000 (UTC) Received: from alto.onthenet.com.au (alto.OntheNet.com.au [203.13.68.12]) by mx1.freebsd.org (Postfix) with ESMTP id D6385313F; Mon, 18 Aug 2014 20:13:32 +0000 (UTC) Received: from dommail.onthenet.com.au (dommail.OntheNet.com.au [203.13.70.57]) by alto.onthenet.com.au (Postfix) with ESMTPS id 875C11245D; Tue, 19 Aug 2014 06:13:24 +1000 (EST) Received: from Peter-Grehans-MacBook-Pro-2.local ([64.245.0.210]) by dommail.onthenet.com.au (MOS 4.4.4-GA) with ESMTP id BXU20151 (AUTH peterg@ptree32.com.au); Tue, 19 Aug 2014 06:13:23 +1000 Message-ID: <53F25E60.5050109@freebsd.org> Date: Mon, 18 Aug 2014 13:13:20 -0700 From: Peter Grehan User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: alc@freebsd.org, Konstantin Belousov Subject: Re: superpages for UMA References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org, Gleb Smirnoff , "Alexander V. Chernikov" , "Andrey V. Elsukov" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 20:13:33 -0000 > Newer Intel CPUs have more entries, and AMD CPUs have long (since > Barcelona) had more. In particular, they allow 2 MB page mappings to be > cached in a larger L2 TLB. Nowadays, the trouble is with the 1 GB pages. > A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB pages. There are new(ish) ones effectively without 1GB pages. From the "Software Optimization Guide for AMD Family 16h Processors" "Smashing" ... "when the Family 16h processor encounters a 1-Gbyte page size, it will smash translations of that 1-Gbyte region into 2-Mbyte TLB entries, each of which translates a 2-Mbyte region of the 1-Gbyte page." later, Peter. From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 20:26:31 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 217179AF for ; Mon, 18 Aug 2014 20:26:31 +0000 (UTC) Received: from mail-pd0-f170.google.com (mail-pd0-f170.google.com [209.85.192.170]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id DEEA533B7 for ; Mon, 18 Aug 2014 20:26:30 +0000 (UTC) Received: by mail-pd0-f170.google.com with SMTP id g10so8306491pdj.1 for ; Mon, 18 Aug 2014 13:26:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:content-type:mime-version:subject:from :in-reply-to:date:cc:message-id:references:to; bh=qUaQfSoyxwXZHjgbkO8snPN3xfU+oyxS+aaJGM92k80=; b=i6cfsNql7eahq12qhbfzM3sYY6jeBUUscyn3McEbGMYvUERc0Dc2fsRCHEsIRfbfMM 1FLtN2O3C1mDB1TiwzEH4dnL8vrkaZ4i8LEdfNSyRmkEot/P68v2iIL3Rp86jSRiT2g6 YlXPbRX5W2wAcvSubFq58zq3+hKDXpnCHcWd1dLPFTS+5OSZvOzB17H90gP8gWAMcbbw kdR7Wf9xAd07QCDJHwcZkxo9Ld+GT4qjfRp6x9H7y7exxiXWX5y5i7nki2R3C2fa8D9m 1fgWaUuv/LSMFbPht7r2twkUZntFLZ20WCjTKxGTnFr+nRY3lT+g4hN4AVBZ3c5pR1oQ yEqA== X-Gm-Message-State: ALoCoQnTZBly1//yBroe/1oKFc8j1kDiL4TG1c+V6eDM+ItyiegfPO5c4M0+dXedoLD7gVrsbLrK X-Received: by 10.70.44.70 with SMTP id c6mr36094617pdm.75.1408393584700; Mon, 18 Aug 2014 13:26:24 -0700 (PDT) Received: from lgmac-jku.corp.netflix.com (dc1-prod.netflix.com. [69.53.236.251]) by mx.google.com with ESMTPSA id hk7sm26086705pdb.4.2014.08.18.13.26.23 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 18 Aug 2014 13:26:23 -0700 (PDT) Sender: Warner Losh Content-Type: multipart/signed; boundary="Apple-Mail=_4C658C27-60E7-4EE6-BC54-329709FB8759"; protocol="application/pgp-signature"; micalg=pgp-sha512 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: superpages for UMA From: Warner Losh In-Reply-To: <53F25E60.5050109@freebsd.org> Date: Mon, 18 Aug 2014 14:26:21 -0600 Message-Id: <257A0976-7C5E-4029-AF32-BFB3A6C60832@bsdimp.com> References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua> <53F25E60.5050109@freebsd.org> To: Peter Grehan X-Mailer: Apple Mail (2.1878.6) Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 20:26:31 -0000 --Apple-Mail=_4C658C27-60E7-4EE6-BC54-329709FB8759 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 On Aug 18, 2014, at 2:13 PM, Peter Grehan wrote: >> Newer Intel CPUs have more entries, and AMD CPUs have long (since >> Barcelona) had more. In particular, they allow 2 MB page mappings to = be >> cached in a larger L2 TLB. Nowadays, the trouble is with the 1 GB = pages. >> A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB = pages. >=20 > There are new(ish) ones effectively without 1GB pages. =46rom the = "Software Optimization Guide for AMD Family 16h Processors" >=20 > "Smashing" > ... > "when the Family 16h processor encounters a 1-Gbyte page size, it will = smash translations of that 1-Gbyte region into 2-Mbyte TLB entries, each > of which translates a 2-Mbyte region of the 1-Gbyte page." =93we=92ll emulate this feature designed to make things go faster in = hardware in software by doing the very thing that makes it go slow in = hardware.=94 Fun times. Performance Smashing! Warner --Apple-Mail=_4C658C27-60E7-4EE6-BC54-329709FB8759 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQIcBAEBCgAGBQJT8mFtAAoJEGwc0Sh9sBEAILYQAJvi/5avR/rBR2VivBhiWVIG 3HjtyIPbTu2XE9OiyF+h4BkREZ9Wu1dyUgKnCKqYM4DPkTGdSAcRGCdSa8GqDYva xV0QU2JH2DpjXZgmlO5JKYVzDmn/7GJVd5Ix71jg5yneg8kKl4U14ZxXcboLAY36 8t020p6vzIKNkz352kXYqLR/aCle3opbzmXTtq3lMqZHc3UMptq+XIG8m91SlQWc 24CSuJOV1W1rvi0RJ2iFR3KYE9cxvA7iUTd8RsqV5aevc22DZsjBLYRuwaA5Z2uy xFVflbrv3bA2vxw1GdtJ/W3LiD1oH+GP0jTGHMMG/jmJTlL6JbnhHR3MT0l3Ue57 dsrI24GV0aarjjHx282cyn77RTsrR0N6Kn0mw1usRWYixY/k5JNqbdQoIXB2Fqyx Mt4Axj3jm9kIjRCJNVx5XCix7md2SU402ac8zXdreD42IvyyXfc6cgWXvd8WNXXK XdEyvRbQs50ktb5eXBpm9yqsRcOl6d0C0tyP7SaDCevmTn6+405Z6QytK3L9Pc+Y yWC5hFaBLw/26JFhjF2E7ysfnfH3Nn+jIS5CgmuPzzp+qXYfmRmf5HQyJ01fr0lh b+tSS4sJV1WOC+tEt/2Joiw3llJYiSO07x4hT/GatVZtk1e4RlRER/AX0suVyF4F Ry22w2qx+U8yfO094Ef1 =h2+C -----END PGP SIGNATURE----- --Apple-Mail=_4C658C27-60E7-4EE6-BC54-329709FB8759-- From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 20:39:23 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3AF41CB1; Mon, 18 Aug 2014 20:39:23 +0000 (UTC) Received: from pp2.rice.edu (proofpoint2.mail.rice.edu [128.42.201.101]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id F250E34A7; Mon, 18 Aug 2014 20:39:22 +0000 (UTC) Received: from pps.filterd (pp2.rice.edu [127.0.0.1]) by pp2.rice.edu (8.14.5/8.14.5) with SMTP id s7IKacaH028827; Mon, 18 Aug 2014 15:39:21 -0500 Received: from mh1.mail.rice.edu (mh1.mail.rice.edu [128.42.201.20]) by pp2.rice.edu with ESMTP id 1numser5te-1; Mon, 18 Aug 2014 15:39:21 -0500 X-Virus-Scanned: by amavis-2.7.0 at mh1.mail.rice.edu, auth channel Received: from 108-254-203-201.lightspeed.hstntx.sbcglobal.net (108-254-203-201.lightspeed.hstntx.sbcglobal.net [108.254.203.201]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh1.mail.rice.edu (Postfix) with ESMTPSA id AA3134601D5; Mon, 18 Aug 2014 15:39:20 -0500 (CDT) Message-ID: <53F26477.8050004@rice.edu> Date: Mon, 18 Aug 2014 15:39:20 -0500 From: Alan Cox User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Peter Grehan , alc@freebsd.org, Konstantin Belousov Subject: Re: superpages for UMA References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua> <53F25E60.5050109@freebsd.org> In-Reply-To: <53F25E60.5050109@freebsd.org> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 kscore.is_bulkscore=0 kscore.compositescore=0 circleOfTrustscore=0 compositescore=0.629899992726084 urlsuspect_oldscore=0.0298999927260837 suspectscore=3 recipient_domain_to_sender_totalscore=0 phishscore=0 bulkscore=0 kscore.is_spamscore=3.8904595378586e-08 recipient_to_sender_totalscore=0 recipient_domain_to_sender_domain_totalscore=498 rbsscore=0.629899992726084 spamscore=0 recipient_to_sender_domain_totalscore=0 urlsuspectscore=0.9 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1408180228 Cc: arch@freebsd.org, Gleb Smirnoff , "Alexander V. Chernikov" , "Andrey V. Elsukov" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 20:39:23 -0000 On 08/18/2014 15:13, Peter Grehan wrote: >> Newer Intel CPUs have more entries, and AMD CPUs have long (since >> Barcelona) had more. In particular, they allow 2 MB page mappings to be >> cached in a larger L2 TLB. Nowadays, the trouble is with the 1 GB >> pages. >> A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB pages. > > There are new(ish) ones effectively without 1GB pages. From the > "Software Optimization Guide for AMD Family 16h Processors" > My recollection is that the first Intel processors to support 1 GB page mappings did this. They allowed you set PG_PS on the 1GB PTE, but there were no actual 1 GB page TLB entries. Also, after I modified the direct map on amd64 to use 1 GB pages, I noticed some strange performance anomalies. Specifically, sometimes performance was worse than I expected. It turned out that when the end of DRAM wasn't aligned to a 1 GB boundary, and the end of DRAM was mapped with a 1 GB PTE, the TLB would wind up with 4 KB mappings for anything covered by that last PTE. Whereas, before, it was at least 2 MB aligned and we would wind up with 2 MB page mappings in the TLB. So, now, the direct creation has an awareness of this issue. > "Smashing" > ... > "when the Family 16h processor encounters a 1-Gbyte page size, it will > smash translations of that 1-Gbyte region into 2-Mbyte TLB entries, each > of which translates a 2-Mbyte region of the 1-Gbyte page." > > later, > > Peter. > From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 20:44:43 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A0679E7D; Mon, 18 Aug 2014 20:44:43 +0000 (UTC) Received: from mail-wg0-x230.google.com (mail-wg0-x230.google.com [IPv6:2a00:1450:400c:c00::230]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E2C783560; Mon, 18 Aug 2014 20:44:42 +0000 (UTC) Received: by mail-wg0-f48.google.com with SMTP id x13so5488000wgg.31 for ; Mon, 18 Aug 2014 13:44:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=RKbRj9D+jRvOkG2T5WFd2NHXTFG96lDpLM/CBZchTD8=; b=Mkw/6+2y3eTJPzcxuoLX4fGff85Kmv9UHSoCsEVwePZu/K7nbBskS7Fj3f588pEE2p x62ZkmAFpawgDjv/a9oi7Llm1aI7GcmHyovzSP/Sv4LqqGMaqpkxdlP0+ifwUGB5/IMk zjEqFCnWwexe8Foo/aZEly+QViurNI7/o2fZujePly5RsUzV4N1+vmPMmdhMe6+2saO0 cwEv+DV+0KzAa8gq0gGcE6TY86mzv9RAoqvR8KxPQYfSfxqSE328ydx3c2kGpVlSEV/I EuUFX53q4HwtuYHyUgG3dwchJZEXU3bS/w1Q4m3QKey04bEvkQW4WgfLALa8l1H3F8gQ 0ytw== MIME-Version: 1.0 X-Received: by 10.180.102.130 with SMTP id fo2mr1450859wib.29.1408394680972; Mon, 18 Aug 2014 13:44:40 -0700 (PDT) Received: by 10.216.160.9 with HTTP; Mon, 18 Aug 2014 13:44:40 -0700 (PDT) In-Reply-To: <20140711232914.GH41807@pwnie.vrt.sourcefire.com> References: <20140711232914.GH41807@pwnie.vrt.sourcefire.com> Date: Mon, 18 Aug 2014 16:44:40 -0400 Message-ID: Subject: Re: [RFC] ASLR Whitepaper and Candidate Final Patch From: Shawn Webb To: freebsd-arch@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: PaX Team , Bryan Drewery , Alan Cox , =?UTF-8?Q?Dag=2DErling_Sm=C3=B8rgrav?= , Oliver Pinter X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 20:44:43 -0000 I've uploaded a new patch to Phabric: https://reviews.freebsd.org/D473. I'm interested in hearing feedback from the community. On Fri, Jul 11, 2014 at 7:29 PM, Shawn Webb wrote: > Hey All, > > Oliver Pinter and I have been working hard on our ASLR implementation. > We're now in the final stages of development and would like to get > feedback from the community. I've attached to this email a small > whitepaper that details our implementation and the accompanying patch. > > There is one part of the patch that I wrote that is quite an ugly hack > and would like to get some feedback on. I added a little hack to > sys_mmap() to apply ASLR to calls to mmap(2) when MAP_32BIT is > specified. I'd like to remove that ugly hack to something a bit more > beautiful, so if anyone has any suggestions, I'm all ears. > > Other than that ugly hack, the code adheres to FreeBSD's style(9) > standards. I believe we have an awesome implementation, one I've > personally been using without issue for months. > > I'm looking forward to your comments and questions. I've CC'd the PaX > team. Please keep them CC'd in your replies. > > Thank you very much, > > Shawn Webb > CC: PaX Team > CC: Oliver Pinter > CC: des@freebsd.org > CC: alc@rice.edu > CC: bdrewery@freebsd.org > > PS - Sorry for the duplicate emails. I hit the wrong key and didn't CC > everyone. > From owner-freebsd-arch@FreeBSD.ORG Mon Aug 18 22:35:53 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C64FAFD1; Mon, 18 Aug 2014 22:35:53 +0000 (UTC) Received: from mail-ie0-x22e.google.com (mail-ie0-x22e.google.com [IPv6:2607:f8b0:4001:c03::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 88F883F43; Mon, 18 Aug 2014 22:35:53 +0000 (UTC) Received: by mail-ie0-f174.google.com with SMTP id rp18so176754iec.5 for ; Mon, 18 Aug 2014 15:35:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=f5LlijFXX7pDpvpWeF6NZZMfcBpCCrtSa9MFRU69x3Y=; b=DFW1mSqsSRI1uFdfrYZDbmQBi+Mx9mf0220Vhq7yvK7Z85n2g6+WDSslneqcjwq35x nk+/DTSIO+RD3w38eu3uSZhsuyRHGYLQpl8unnB6+YSPzfWuiJPETJsGQk7F3DqryLP7 JwAm9jDQXaKUl7RmPVEXSuYKAFV+Yzr+Q2jc+PDsw/vAu6RbJmr7eEKuxFHU7nPZkz/4 2bT1S3rLrRg0ROEiYqo+5ot0zzgAj3WdbePnldg+6Wrdm3J8i20XCDoEGYmWFCcwaqoc W57jSrAVxyOPG55MvQq+SSCZoVGUhPMl+0BEEclylHD2rJTq+tbl37/j38ISM3hkobZl oOhQ== MIME-Version: 1.0 X-Received: by 10.43.127.136 with SMTP id ha8mr3526994icc.78.1408401352808; Mon, 18 Aug 2014 15:35:52 -0700 (PDT) Received: by 10.43.17.196 with HTTP; Mon, 18 Aug 2014 15:35:52 -0700 (PDT) Reply-To: alc@freebsd.org In-Reply-To: <257A0976-7C5E-4029-AF32-BFB3A6C60832@bsdimp.com> References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua> <53F25E60.5050109@freebsd.org> <257A0976-7C5E-4029-AF32-BFB3A6C60832@bsdimp.com> Date: Mon, 18 Aug 2014 17:35:52 -0500 Message-ID: Subject: Re: superpages for UMA From: Alan Cox To: Warner Losh Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: "freebsd-arch@freebsd.org" , Peter Grehan X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Aug 2014 22:35:53 -0000 On Mon, Aug 18, 2014 at 3:26 PM, Warner Losh wrote: > > On Aug 18, 2014, at 2:13 PM, Peter Grehan wrote: > > >> Newer Intel CPUs have more entries, and AMD CPUs have long (since > >> Barcelona) had more. In particular, they allow 2 MB page mappings to = be > >> cached in a larger L2 TLB. Nowadays, the trouble is with the 1 GB > pages. > >> A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB page= s. > > > > There are new(ish) ones effectively without 1GB pages. From the > "Software Optimization Guide for AMD Family 16h Processors" > > > > "Smashing" > > ... > > "when the Family 16h processor encounters a 1-Gbyte page size, it will > smash translations of that 1-Gbyte region into 2-Mbyte TLB entries, each > > of which translates a 2-Mbyte region of the 1-Gbyte page." > > =E2=80=9Cwe=E2=80=99ll emulate this feature designed to make things go fa= ster in hardware > in software by doing the very thing that makes it go slow in hardware.=E2= =80=9D > > Fun times. Performance Smashing! > > I'm guessing that these are low-power processors, where they don't want to have another CAM consuming power. Under those circumstances, it's still better to support 1 GB page mappings in the page table even if the TLB doesn't support them than not to support 1 GB page mappings at all. With the hierarchical page tables on x86, you get a 512x reduction in page table size with each increase in page size. So, on a TLB miss, the page table walk is more likely to be all L2 data cache hits, rather than misses that go all the way to DRAM. One feature that I always liked about the AMD performance counters was that they allowed you to count L2 cache misses caused by page table walks on a TLB miss. This was often a better predictor of whether large pages were going to be beneficial than counting TLB misses. From owner-freebsd-arch@FreeBSD.ORG Tue Aug 19 19:24:17 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 62593AD6 for ; Tue, 19 Aug 2014 19:24:17 +0000 (UTC) Received: from mail-ie0-x22b.google.com (mail-ie0-x22b.google.com [IPv6:2607:f8b0:4001:c03::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 2B0F9366D for ; Tue, 19 Aug 2014 19:24:17 +0000 (UTC) Received: by mail-ie0-f171.google.com with SMTP id at1so1726811iec.30 for ; Tue, 19 Aug 2014 12:24:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=AhsNDcdclo1l2TRZsvQfLnLnR5DxqhmfcXkd/jJWcPc=; b=BNJrdKhVfXWJuAXB7PLbYJQ1f2XHe/S1cUUOsRgv2j/hQGSjHyDMOnWA931HCpuhO/ 7Nrebe6UCFNRsrOBGkSOvqZ0JVpmP8ESUBDvGxjFTka+s4V49HDvvnhDY+RE7kNsqSrV OumDKpSiltmk5hHj5HAeKvIrkq1kVMmHNWAaiKapROlL/zyxeG1IqnlnxlO+i+h6XeZO YhuS0+wEmV+4ePOsTZK8O57yLWtkU4zac8MMosYM8ZKH4qPJkULmk3zyiz3Vv0GqBiqV 7WB5L29wvH44V4UNYkfCEV7BBD5Vhu0jBPd8j63bhuXMGY2NhIQDHWUuy++fvfsFy7FM rO9g== MIME-Version: 1.0 X-Received: by 10.43.164.130 with SMTP id ms2mr44412552icc.9.1408476256600; Tue, 19 Aug 2014 12:24:16 -0700 (PDT) Received: by 10.43.17.196 with HTTP; Tue, 19 Aug 2014 12:24:16 -0700 (PDT) Reply-To: alc@freebsd.org In-Reply-To: <20140817012646.GA21025@dft-labs.eu> References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> <1408064112-573-2-git-send-email-mjguzik@gmail.com> <20140816093811.GX2737@kib.kiev.ua> <20140816185406.GD2737@kib.kiev.ua> <20140817012646.GA21025@dft-labs.eu> Date: Tue, 19 Aug 2014 14:24:16 -0500 Message-ID: Subject: Re: [PATCH 1/2] Implement simple sequence counters with memory barriers. From: Alan Cox To: Mateusz Guzik Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: Konstantin Belousov , Johan Schuijt , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Aug 2014 19:24:17 -0000 On Sat, Aug 16, 2014 at 8:26 PM, Mateusz Guzik wrote: > On Sat, Aug 16, 2014 at 09:54:06PM +0300, Konstantin Belousov wrote: > > On Sat, Aug 16, 2014 at 12:38:11PM +0300, Konstantin Belousov wrote: > > > On Fri, Aug 15, 2014 at 02:55:11AM +0200, Mateusz Guzik wrote: > > > > --- > > > > sys/sys/seq.h | 126 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > 1 file changed, 126 insertions(+) > > > > create mode 100644 sys/sys/seq.h > > > > > > > > diff --git a/sys/sys/seq.h b/sys/sys/seq.h > > > > new file mode 100644 > > > > index 0000000..0971aef > > > > --- /dev/null > > > > +++ b/sys/sys/seq.h > [..] > > > > +#ifndef _SYS_SEQ_H_ > > > > +#define _SYS_SEQ_H_ > > > > + > > > > +#ifdef _KERNEL > > > > + > > > > +/* > > > > + * Typical usage: > > > > + * > > > > + * writers: > > > > + * lock_exclusive(&obj->lock); > > > > + * seq_write_begin(&obj->seq); > > > > + * ..... > > > > + * seq_write_end(&obj->seq); > > > > + * unlock_exclusive(&obj->unlock); > > > > + * > > > > + * readers: > > > > + * obj_t lobj; > > > > + * seq_t seq; > > > > + * > > > > + * for (;;) { > > > > + * seq =3D seq_read(&gobj->seq); > > > > + * lobj =3D gobj; > > > > + * if (seq_consistent(&gobj->seq, seq)) > > > > + * break; > > > > + * cpu_spinwait(); > > > > + * } > > > > + * foo(lobj); > > > > + */ > > > > + > > > > +typedef uint32_t seq_t; > > > > + > > > > +/* A hack to get MPASS macro */ > > > > +#include > > > > +#include > > > > + > > > > +#include > > > > + > > > > +static __inline bool > > > > +seq_in_modify(seq_t seqp) > > > > +{ > > > > + > > > > + return (seqp & 1); > > > > +} > > > > + > > > > +static __inline void > > > > +seq_write_begin(seq_t *seqp) > > > > +{ > > > > + > > > > + MPASS(!seq_in_modify(*seqp)); > > > > + (*seqp)++; > > > > + wmb(); > > > This probably ought to be written as atomic_add_rel_int(seqp, 1); > > Alan Cox rightfully pointed out that better expression is > > v =3D *seqp + 1; > > atomic_store_rel_int(seqp, v); > > which also takes care of TSO on x86. > > > > Well, my memory-barrier-and-so-on-fu is rather weak. > > I had another look at the issue. At least on amd64, it looks like only > compiler barrier is required for both reads and writes. > > According to AMD64 Architecture Programmer=E2=80=99s Manual Volume 2: Sys= tem > Programming, 7.2 Multiprocessor Memory Access Ordering states: > > "Loads do not pass previous loads (loads are not reordered). Stores do > not pass previous stores (stores are not reordered)" > > Since the code modifying stuff only performs a series of writes and we > expect exclusive writers, I find it applicable to this scenario. > > I checked linux sources and generated assembly, they indeed issue only > a compiler barrier on amd64 (and for intel processors as well). > > atomic_store_rel_int on amd64 seems fine in this regard, but the only > function for loads issues lock cmpxhchg which kills performance > (median 55693659 -> 12789232 ops in a microbenchmark) for no gain. > > Additionally release and acquire semantics seems to be a stronger than > needed guarantee. > > This statement left me puzzled and got me to look at our x86 atomic.h for the first time in years. It appears that our implementation of atomic_load_acq_int() on x86 is, umm ..., unconventional. That is, it is enforcing a constraint that simple acquire loads don't normally enforce. For example, the C11 stdatomic.h simple acquire load doesn't enforce this constraint. Moreover, our own implementation of atomic_load_acq_int() on ia64, where the mapping from atomic_load_acq_int() to machine instructions is straightforward, doesn't enforce this constraint either. Give us a chance to sort this out before you do anything further. As Kostik said, but in different words, we've always written our machine-independent layer code using acquires and releases to express the required ordering constraints and not {r,w}mb() primitives. > As far as sequence counters go, we should be able to get away with > making the following: > - all relevant reads are performed between given points > - all relevant writes are performed between given points > > As such, I propose introducing another atomic_* function variants > (or stealing smp_{w,r,}mb idea from linux) which provide just that. > > So for amd64 reading guarantee and writing guarantee could be provided > in the same way with a compiler barrier. > > > > Same note for all other linux-style barriers. In fact, on x86 > > > wmb() is sfence and it serves no useful purpose in seq_write*. > > > > > > Overall, it feels too alien and linux-ish for my taste. > > > Since we have sequence bound to some lock anyway, could we introduce > > > some sort of generation-aware locks variants, which extend existing > > > locks, and where lock/unlock bump generation number ? > > Still, merging it to the guts of lock implementation is right > > approach, IMO. > > > > Current usage would be along with filedesc (sx) lock. The lock protects > writes to entire fd table (and lock holders can block in malloc), while > each file descriptor has its own counter. Also areas covered by seq are > short and cannot block. > > As such, I don't really see any way to merge the lock with the counter. > > I agree it would be useful, provided area protected by the lock would be > the same as the one protected by the counter. If this code hits the tree > and one day turns out someone needs such functionality, there should not > be any problems (apart from time effort) in implementing this. > > -- > Mateusz Guzik > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Wed Aug 20 14:14:17 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 99F90CF2 for ; Wed, 20 Aug 2014 14:14:17 +0000 (UTC) Received: from mail-la0-x22e.google.com (mail-la0-x22e.google.com [IPv6:2a00:1450:4010:c03::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 23A3A33AA for ; Wed, 20 Aug 2014 14:14:16 +0000 (UTC) Received: by mail-la0-f46.google.com with SMTP id b8so7430222lan.33 for ; Wed, 20 Aug 2014 07:14:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:subject:message-id:mail-followup-to :mime-version:content-type:content-disposition:user-agent; bh=679Ranpv3mj3AMEB7vEKIdLsDV9vUA3k9ZcEWr9DivI=; b=tQbaED1OOuLfgdnYpHquW9W59Ok40jw7DlwhK4c/W+QsBsjk58e+4yu/diO16f7dlt WAK5yypywKplmyU/xTHi9atZ5skpAF/OTcRBLYJQaALWovzTlAPFhUFBf0TggWn3jT1Q VTlee9WUiOpJfR3Oii7VoUtKnbMcp9MozjYqZffbsC1q2CRBk3XmD85xmH5xiCD4WQk1 IUwa2E+6lKp3g556W5LgjSKDyaUgxBuApxevDCVhsnL3HtjqkYmoZqCKKP4aWuwvBt5a 7toMCMI+tyJ2M2tnis2yEjhdvulLZ2zNhLNsq214vKpJV5iTva9ktknceLdvjMAbG8A9 620A== X-Received: by 10.152.36.195 with SMTP id s3mr42458925laj.28.1408544055014; Wed, 20 Aug 2014 07:14:15 -0700 (PDT) Received: from pc5.home (adbj194.neoplus.adsl.tpnet.pl. [79.184.9.194]) by mx.google.com with ESMTPSA id a1sm14515456lak.45.2014.08.20.07.14.13 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 20 Aug 2014 07:14:14 -0700 (PDT) Sender: =?UTF-8?Q?Edward_Tomasz_Napiera=C5=82a?= Date: Wed, 20 Aug 2014 16:14:11 +0200 From: Edward Tomasz =?utf-8?Q?Napiera=C5=82a?= To: arch@FreeBSD.org Subject: Autofs startup scripts. Message-ID: <20140820141411.GB12179@pc5.home> Mail-Followup-To: arch@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Aug 2014 14:14:17 -0000 As it is now, autofs uses three separate rc.d scripts: automount, automountd, and autounmountd. They execute one utility and two deamons. They are all controlled by a single rc var: autofs_enable. Question is: is this the right way to do it? Would it be better to have only one script instead? If I went this route, how should configuring command line options for each of the three executables work? From owner-freebsd-arch@FreeBSD.ORG Wed Aug 20 16:00:40 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2693371A; Wed, 20 Aug 2014 16:00:40 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id F0661307E; Wed, 20 Aug 2014 16:00:39 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id D78A1B9C4; Wed, 20 Aug 2014 12:00:38 -0400 (EDT) From: John Baldwin To: Benjamin Kaduk Subject: Re: current fd allocation idiom Date: Wed, 20 Aug 2014 11:10:10 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.4-CBSD-20140415; KDE/4.5.5; amd64; ; ) References: <20140717235538.GA15714@dft-labs.eu> <20140813015627.GC17869@dft-labs.eu> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201408201110.10431.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 20 Aug 2014 12:00:38 -0400 (EDT) Cc: Konstantin Belousov , Mateusz Guzik , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Aug 2014 16:00:40 -0000 On Friday, August 15, 2014 7:20:03 pm Benjamin Kaduk wrote: > On Tue, 12 Aug 2014, Mateusz Guzik wrote: > > > On Tue, Aug 12, 2014 at 09:31:15PM -0400, Benjamin Kaduk wrote: > > > On Tue, Aug 12, 2014 at 7:36 PM, Mateusz Guzik wrote: > > > > > > > I would expect soabort to result in a timeout/reset as opposed to regular > > > > connection close. > > > > > > > > Comments around soabort suggest it should not be used as a replacement > > > > for close, but maybe this is largely because of what the other end will > > > > see. That will need to be investigated. > > > > > > > > > > > I added some text regarding soabort to socket.9 in r266962 -- does that > > > help clarify the situation? > > > > > > > Nope. :-) > > > > It is unclear if the only motivation here is making sure nobody else > > sees the socket when given thread calls soabort. This would be easily > > guaranteed in here: fd allocation failed, fp with given socket was never > > exposed to anyone. > > > > So, if you say soabort would work here just fine, I'm happy to use it > > and blame you for problems. :-) > > Hmm, I was hoping that jhb would chime in and save me from being on the > hook, but it does look like soabort() would be acceptable in this case. I think having the EMFILE/ENFILE case use the same exact logic as a listen queue overflow (i.e. soabort()) is correct. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Aug 20 16:00:41 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 6B09A71C; Wed, 20 Aug 2014 16:00:41 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 40B66307F; Wed, 20 Aug 2014 16:00:41 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 28E96B9CA; Wed, 20 Aug 2014 12:00:40 -0400 (EDT) From: John Baldwin To: Bruce Evans Subject: Re: [PATCH 0/2] plug capability races Date: Wed, 20 Aug 2014 11:11:47 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.4-CBSD-20140415; KDE/4.5.5; amd64; ; ) References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> <201408151031.45967.jhb@freebsd.org> <20140816102840.V1007@besplex.bde.org> In-Reply-To: <20140816102840.V1007@besplex.bde.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201408201111.47601.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 20 Aug 2014 12:00:40 -0400 (EDT) Cc: Robert Watson , Mateusz Guzik , Konstantin Belousov , Johan Schuijt , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Aug 2014 16:00:41 -0000 On Friday, August 15, 2014 9:34:59 pm Bruce Evans wrote: > On Fri, 15 Aug 2014, John Baldwin wrote: > > > One thing I would like to see is for the timecounter code to be adapted to use > > the seq API instead of doing it by hand (the timecounter code is also missing > > barriers due to doing it by hand). > > Locking in the timecounter code is poor (1), but I fear a general mechanism > would be slower. Also, the timecounter code now extends into userland, > so purely kernel locking cannot work for it. The userland part is > more careful about locking than the kernel. It has memory barriers and > other pessimizations which were intentionally left out of the kernel > locking for timecounters. If these barriers are actually necessary, then > they give the silly situation that there are less races for userland > timecounting than kernel timecounting provided userland mostly does > direct accesses instead of syscalls and kernel uses of timecounters are > are infrequent enough to not race often with the userland accesses. Yes, the userland code is more correct here. The barriers are indeed missing in the kernel part, and adding them should give something equivalant to a correctly working seq API as it is doing the same thing. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Aug 20 16:15:57 2014 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BCA6DDC7 for ; Wed, 20 Aug 2014 16:15:57 +0000 (UTC) Received: from mail-la0-x22c.google.com (mail-la0-x22c.google.com [IPv6:2a00:1450:4010:c03::22c]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 476983349 for ; Wed, 20 Aug 2014 16:15:57 +0000 (UTC) Received: by mail-la0-f44.google.com with SMTP id el20so7641445lab.3 for ; Wed, 20 Aug 2014 09:15:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type:content-transfer-encoding; bh=9tz6pRzSioXhyPeMN2Pix2qrWd+bOt74pN40GoGgel0=; b=wFmyyo8T9YVEYxQwSVeyYkIykLZKqMu/U3GwuDWNyqQ3hh6pzgY3zFUP7q9S7nnxM2 U7QTzDV+OeusnjxQVTjuxTXn9CAbf96IMXsqda196oLDhQkFhJWF2b8hMyp2XMgin21z 0H1Mha6PKF0sW1jZ6klYCPYILyk2hZbrTrMlW5SN+DGHaBRTYpQhIhZKSwlHrd2kH45y hofqdNSnygQbl6GNMia9mUxoWo1DNSejTFjPKRpuI3aEyu1FOvLyPb3NytVDBA3CBB0v 3GhXJhkKr58lzvfyJZxPmPUEEoq1UcABcGxq8lsYe5tH3iCdphgULezJHp9GufVSQElZ 7C/A== MIME-Version: 1.0 X-Received: by 10.152.22.165 with SMTP id e5mr22478131laf.57.1408551355016; Wed, 20 Aug 2014 09:15:55 -0700 (PDT) Sender: crodr001@gmail.com Received: by 10.112.197.107 with HTTP; Wed, 20 Aug 2014 09:15:54 -0700 (PDT) In-Reply-To: <20140820141411.GB12179@pc5.home> References: <20140820141411.GB12179@pc5.home> Date: Wed, 20 Aug 2014 09:15:54 -0700 X-Google-Sender-Auth: juSB_INJEsQz6N3GjeO_CN0oqzE Message-ID: Subject: Re: Autofs startup scripts. From: Craig Rodrigues To: arch@freebsd.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Aug 2014 16:15:57 -0000 On Wed, Aug 20, 2014 at 7:14 AM, Edward Tomasz Napiera=C5=82a wrote: > As it is now, autofs uses three separate rc.d scripts: automount, > automountd, and autounmountd. They execute one utility and two deamons. > They are all controlled by a single rc var: autofs_enable. Question > is: is this the right way to do it? Would it be better to have only > one script instead? If I went this route, how should configuring > command line options for each of the three executables work? You could probably combine everything into one autofs script, since those three scripts are very closely related. You could have separate automount_args, automountd_args, autounmountd_args variables for each binary. There is a freebsd-rc@ mailing list where you can ask for help on this stuff, but it is a low traffic list. -- Craig From owner-freebsd-arch@FreeBSD.ORG Wed Aug 20 19:31:15 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 162CFC31; Wed, 20 Aug 2014 19:31:15 +0000 (UTC) Received: from mail.dawidek.net (garage.dawidek.net [91.121.88.72]) by mx1.freebsd.org (Postfix) with ESMTP id D27723CC5; Wed, 20 Aug 2014 19:31:13 +0000 (UTC) Received: from localhost (89-77-9-208.dynamic.chello.pl [89.77.9.208]) by mail.dawidek.net (Postfix) with ESMTPSA id 9645915A; Wed, 20 Aug 2014 21:23:10 +0200 (CEST) Date: Wed, 20 Aug 2014 21:24:19 +0200 From: Pawel Jakub Dawidek To: Mateusz Guzik Subject: Re: [PATCH 0/2] plug capability races Message-ID: <20140820192419.GA1834@garage.freebsd.pl> References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1408064112-573-1-git-send-email-mjguzik@gmail.com> X-OS: FreeBSD 11.0-CURRENT amd64 User-Agent: Mutt/1.5.22 (2013-10-16) Cc: Konstantin Belousov , Robert Watson , Johan Schuijt , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Aug 2014 19:31:15 -0000 The patch looks good to me. Thanks for working on the fix, Mateusz! The only minor nit I found is that fde_change, fde_change_size and fde_seq should use capital letters as those are macros. On Fri, Aug 15, 2014 at 02:55:10AM +0200, Mateusz Guzik wrote: > fget_unlocked currently reads 'fde' which is a structure consisting of > serveral fields. In effect the read is inatomic and may result in > obtaining file pointer with stale or incorrect capabilities. > > Example race is with dup2. > > Side effect is that capability checks can be circumvented. > > Proposed way to fix it is with the help of sequence counters. > > Patchset assumes stuff from > 'Getting rid of atomic_load_acq_int(&fdp->fd_nfiles)) from fget_unlocked' > ( http://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015550.html ) > is applied. There is no technical dependency between patches (apart from > READ_ONCE), but this patch amortizes performance hit introduced with seqlock. > > So this introduces a measurable hit with a microbenchmark (16 threads > reading from a pipe which fails with EAGAIN), but is still much faster than > current code with atomic_load_acq_int(&fdp->fd_nfiles). > > x propernoacq-readpipe-run-sum > + seq2-noacq-readpipe-run-sum > N Min Max Median Avg Stddev > x 20 59479718 59527286 59496714 59499504 13752.968 > + 20 54520752 54920054 54829539 54773480 136842.96 > Difference at 95.0% confidence > -4.72602e+06 +/- 62244.4 > -7.94296% +/- 0.104613% > (Student's t, pooled s = 97250) > > There is still one theoretical race unfixed, but I don't believe it matters > much. > > The race is: > fp gets reallocated before refcount check. this resuls in returning fp > regardless of new caps, but I don't see how this particular race could be > exploited. It could be fixed by re-reading entire fde and checking if it > changed. > > -- > 2.0.2 > -- Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://mobter.com From owner-freebsd-arch@FreeBSD.ORG Wed Aug 20 20:23:06 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DE8A5475; Wed, 20 Aug 2014 20:23:05 +0000 (UTC) Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au [211.29.132.53]) by mx1.freebsd.org (Postfix) with ESMTP id 893D432A9; Wed, 20 Aug 2014 20:23:05 +0000 (UTC) Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 7BD19D448FA; Thu, 21 Aug 2014 06:22:56 +1000 (EST) Date: Thu, 21 Aug 2014 06:22:55 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin Subject: Re: [PATCH 0/2] plug capability races In-Reply-To: <201408201111.47601.jhb@freebsd.org> Message-ID: <20140821044234.H11472@besplex.bde.org> References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> <201408151031.45967.jhb@freebsd.org> <20140816102840.V1007@besplex.bde.org> <201408201111.47601.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=BdjhjNd2 c=1 sm=1 tr=0 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=tTSYktBZc9AA:10 a=KN91Z2BipYgA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=YwIKbEHEEUB-9GaOcawA:9 a=kzYn1Pzwvs4spdd-:21 a=Ip1ZeEM7m2elqLRx:21 a=CjuIK1q_8ugA:10 Cc: Mateusz Guzik , Robert Watson , Johan Schuijt , freebsd-arch@freebsd.org, Konstantin Belousov X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Aug 2014 20:23:06 -0000 On Wed, 20 Aug 2014, John Baldwin wrote: > On Friday, August 15, 2014 9:34:59 pm Bruce Evans wrote: >> On Fri, 15 Aug 2014, John Baldwin wrote: >> >>> One thing I would like to see is for the timecounter code to be adapted to use >>> the seq API instead of doing it by hand (the timecounter code is also missing >>> barriers due to doing it by hand). >> >> Locking in the timecounter code is poor (1), but I fear a general mechanism >> would be slower. Also, the timecounter code now extends into userland, >> so purely kernel locking cannot work for it. The userland part is >> more careful about locking than the kernel. It has memory barriers and >> other pessimizations which were intentionally left out of the kernel >> locking for timecounters. If these barriers are actually necessary, then >> they give the silly situation that there are less races for userland >> timecounting than kernel timecounting provided userland mostly does >> direct accesses instead of syscalls and kernel uses of timecounters are >> are infrequent enough to not race often with the userland accesses. > > Yes, the userland code is more correct here. The barriers are indeed missing in > the kernel part, and adding them should give something equivalant to a correctly > working seq API as it is doing the same thing. Userland is technically correct, but this defeats the point of the intended algorithm. I now remember a bit more about the algorithm. There are several generations of timehands. Each generation remains stable for several clock ticks. That should be several clock ticks at 100 Hz. Normally there is no problem with just using the old pointer read from timehands (except there is no serialization for updating timehands itself (*)). However, the thread might be preempted for several clock ticks. This is enough time for the old generation to change. The generation count is used to detect such changes. Again it doesn't matter if the generation count is out of date, unless it is out of date by a few generations. So the algorithm works unless the CPU de-serializes things by more than a few clock ticks. I think no real CPUs do that. Virtual CPUs can do that, but I think they aren't a problem in practice. Single stepping in ddb gives a sort of virtual CPU and breaks the algorthm since time runs much faster outside of the stepped process and may do several generations per step. The generation count protects against using a changed timehands but may cause binuptime() to never terminate instead. It takes much weirder virtualization than that to break the generation count itself. Any normal preemption or abnormal stopping of CPUs uses locks galore which synchronize everything on at least x86. Variable-tick kernels give another problem. They sometimes issue virtual clock interrupts to catch up. I think they take some care with tc_windup() but perhaps not enough. tc_windup() calls must be separated so that the timehands don't cycle too fast or too slow in either real time or time related to other system operation (there are hard real time requirements mainly for reading real hardware timecounters before they overflow). (*): % binuptime(struct bintime *bt) % { % struct timehands *th; % u_int gen; % % do { % th = timehands; Since tc_windup() also doesn't dream of memory ordering, timehands here may be in the future of what it points to. That is much worse than it being in the past. Barriers would be cheap in tc_windup() but useless if they require barriers in binuptime() to work. tc_windup() is normally called from the clock interrupt handler. There are several mutexes (or at least atomic ops that give synchronization on at least x86 SMP) before and after it. These gives serialization very soon after the changes. The fix (without adding any barrier instructions) is easy. Simply run the timehands update 1 or 2 generations behind the update of what it points to. This gives even more than time-domain locking, since the accidental synchronization from the interrupt handler gives ordering between the update of the pointed-to data and the timehands pointer. % gen = th->th_generation; It doesn't matter if the generation count is in the future, but it needs to be the same as what was written in the past or future. % *bt = th->th_offset; % bintime_addx(bt, th->th_scale * tc_delta(th)); % } while (gen == 0 || gen != th->th_generation); % } Now the timehands update code: % /* % * Now that the struct timehands is again consistent, set the new % * generation number, making sure to not make it zero. % */ It is only sure to be consistent on in-order CPUs. % if (++ogen == 0) % ogen = 1; % th->th_generation = ogen; % % /* Go live with the new struct timehands. */ % #ifdef FFCLOCK % switch (sysclock_active) { % case SYSCLOCK_FBCK: % #endif I don't like the FFCLOCK complications. They interact with the locking bugs a little here. % time_second = th->th_microtime.tv_sec; % time_uptime = th->th_offset.sec; Old versions had only these 2 statements before setting timehands and returning. These are racy enough. Using these variables is racier. They have type time_t, so they might be 64 bits on 32-bit arches so reading them might be non-atomic. In practice, very strong time-domain locking applies -- the races won't occur until the top bits start being actually used a mere 24 years from now. Then there will be a race window of a few microseconds. The generation count should be used to make accesses to these variables techically correct and slow. % #ifdef FFCLOCK % break; % case SYSCLOCK_FFWD: % time_second = fftimehands->tick_time_lerp.sec; % time_uptime = fftimehands->tick_time_lerp.sec - ffclock_boottime.sec; % break; Perhaps more races from more complicated expressions. Also a style bug (long line). % } % #endif % % timehands = th; % timekeep_push_vdso(); % } timekeep_push_vdso() has a couple of atomic stores in it. Perhaps these give perfect serialization for the user variables. On some arches, they accidentally sync the kernel variables a little earlier than the accidental sync from the interrupt handler. Still out of order with the kernel variable updates. Again, this shouldn't be needed -- use a delayed pointer update for the user variables too. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Aug 21 03:34:58 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DE494AB2; Thu, 21 Aug 2014 03:34:58 +0000 (UTC) Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id 88D463C6F; Thu, 21 Aug 2014 03:34:58 +0000 (UTC) Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id AFCF61A2C2E; Thu, 21 Aug 2014 13:34:48 +1000 (EST) Date: Thu, 21 Aug 2014 13:34:47 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans Subject: Re: [PATCH 0/2] plug capability races In-Reply-To: <20140821044234.H11472@besplex.bde.org> Message-ID: <20140821113753.D933@besplex.bde.org> References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> <201408151031.45967.jhb@freebsd.org> <20140816102840.V1007@besplex.bde.org> <201408201111.47601.jhb@freebsd.org> <20140821044234.H11472@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=AOuw8Gd4 c=1 sm=1 tr=0 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=tTSYktBZc9AA:10 a=KN91Z2BipYgA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=rnMR6aIR_FJUPtQO_FsA:9 a=W3cemsHr8jZuBReB:21 a=LqShDMf_JJxsx4l9:21 a=CjuIK1q_8ugA:10 Cc: Mateusz Guzik , Robert Watson , Johan Schuijt , freebsd-arch@freebsd.org, Konstantin Belousov X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Aug 2014 03:34:59 -0000 On Thu, 21 Aug 2014, Bruce Evans wrote: > ... > I now remember a bit more about the algorithm. There are several > generations of timehands. Each generation remains stable for several > clock ticks. That should be several clock ticks at 100 Hz. Normally > there is no problem with just using the old pointer read from timehands > (except there is no serialization for updating timehands itself (*)). > ... > (*): > > % binuptime(struct bintime *bt) > % { > % struct timehands *th; > % u_int gen; > % % do { > % th = timehands; > > Since tc_windup() also doesn't dream of memory ordering, timehands here > may be in the future of what it points to. That is much worse than it > being in the past. Barriers would be cheap in tc_windup() but useless > if they require barriers in binuptime() to work. > > tc_windup() is normally called from the clock interrupt handler. There > are several mutexes (or at least atomic ops that give synchronization on > at least x86 SMP) before and after it. These gives serialization very > soon after the changes. > > The fix (without adding any barrier instructions) is easy. Simply > run the timehands update 1 or 2 generations behind the update of what > it points to. This gives even more than time-domain locking, since > the accidental synchronization from the interrupt handler gives ordering > between the update of the pointed-to data and the timehands pointer. > ... More details: - lock tc_windup() and tc_ticktock() using a spinlock - add hard real-time rate limiting and error recovery so that the timehands are not cycled through too fast or too slow. tc_ticktock() already does this for calls from the clock interrupt handler except when clock interrupts are non-hard. tc_ticktock() can use mtx_trylock() and do nothing if the mutex is contested. tc_setclock() and possibly inittimecounter() should wait to synchronize with the next clock interrupt that would call tc_windup(), and advance the time that they set by the wait delay plus previous delays, and even more, since its changes shouldn't go live for several generations. It sort of does this now, in a broken way. It corrupts the boot time using racy accesses. This limits problems from large adjustments to realtime clock ids (the ones that add the boot time). There are no further delays, just races accessing the boot time in critical places like boottime(). Delays are now also limited by calling tc_windup() and tc_windup() going live with updated timehands almost immediately (as soon as it complete). The immediate tc_windup() call is commented on as being to fiddle with all the crinkly bits aroudn the fiords, but the only criticial thing it does is update the generation count in a fiarly non-racy way -- this tells bintime() to loop, so it has a chance of picking up the changed boot time with a coherent value. sysctl_kern_timecounter_hardware() should call tc_windup() to do a staged update way much like for tc_setclock(). It refrains from doing this because of the races, but it hacks on the timehands pointer in a different and even more fragile racy way. It now calls timekeep_push_vdso() to do the userland part of tc_windup(). The timehands may be recycled too slowly. This happens mainly on suspension. The system depends on frequent windups to work, so it can't run really tickless. After suspension, all old generations are garbage but their generation counts might not have been updated to indicate this. The system should at least try to detect this. I don't understand what happens for timecounters on resume now. - in tc_windup(), bump the generation count for the second-oldest generation instead of setting it to 0 for the current generation, and update the timehands for the oldest generation instead of changing them for the current generation. This also fixes busy-waiting and contention on the timehands for the current generation during the windup. Using the special generation count of 0 essentially reduces the "queue" of timehands from length 10 to length 0 during the windup, at a cost of complications and bugs. It also makes the other 9 generations of the timehands almost never used, and not very useful. 1 generation together with a generation count that is set to 0 during windups suffices, at the cost of spinning while the generation count is 0 and complications and bugs in accesses to the generation count. But the current version already has all these costs in the usual case where the generation changes. tc_windup() is supposed to run with interrupts disabled, so that it cannot be preempted and the length of the spinning is bounded. (Having only Giant locking for the call in settime() is even worse than first appeared. It doesn't prevent preemption at all, so the length of the spinning is unbounded.) In unusal cases, binuptime() is preempted and the generation count changes many times before the original timehands is used. Then the pointer to it is invalid. But the generation count in it has increased by more than usual, so the change is detected and the pointer is updated. So old generations are not used for storing anything important except for the generation count, and having 10 generations just reduces the rate of increase of generation counts by a factor of 10, so it takes preemption by 10 ** 2^32 windups instead of only 2**32 for the algorithm to by broken by wraparound of the generation count (with HZ = 1000, that is 490 days of preemption instead of only 49). The delayed updates might cause different complications. I think ntp seconds updates strictly should to be done in advance so as to go live on seconds rollover. The details can't be too critical, since with HZ = 100 tc_windup calls are out of sync with seconds rollovers by an average of 5 milliseconds (+-5) and no one seemed to notice problems from that. Isn't there an error of 1 second for the duration of the sync time around leap seconds adjustments? With HZ = 1000 the update "queue" with intentionally delayed updates could have length 5 and give much the same behaviour except for missing races (the average delay would still be 5 milliseconds but now +-0.5). Bruce