From owner-svn-src-head@freebsd.org  Fri Jun  8 06:09:00 2018
Return-Path: <owner-svn-src-head@freebsd.org>
Delivered-To: svn-src-head@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 49C021011A0B;
 Fri,  8 Jun 2018 06:09:00 +0000 (UTC)
 (envelope-from rlibby@gmail.com)
Received: from mail-pl0-f66.google.com (mail-pl0-f66.google.com
 [209.85.160.66])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id AFBB774E82;
 Fri,  8 Jun 2018 06:08:59 +0000 (UTC)
 (envelope-from rlibby@gmail.com)
Received: by mail-pl0-f66.google.com with SMTP id z9-v6so7641074plk.11;
 Thu, 07 Jun 2018 23:08:59 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc;
 bh=j66XXLro73VuAPWcjhFEcJXrR6lRTT7xNLnWKS+TY3E=;
 b=DZFIsgxONNS4ZCV5C0W4XuXt5K6el7ez/AIuj7gkqaSisjegOwkQTOFnMgAoGXApnG
 diypHl7jvVsjvdGKKIjpcJZMkqY9a9mWpE+L5ZXKRjywwEuQlt0Yn2NZkGHSSZSeg8+t
 qyirYCPNP3bAoSBSsBXf/9XQJhzoISzV3rZ10+C52PUJ+PmEbgJkI88WU/nM8TZVlZcY
 PL+bhBYdlHV7AgGbn2ttosCChmQ8EaJHPFfdMP9vgJ/df8cTUL07UKUHSqHXxl37lf2x
 Sdi/34SmWuNsehgPUtUpPWDFtVU3re3XwalSTZHbbK5SwIyGwdB2bRPCw3BeB8L5AFTj
 umcQ==
X-Gm-Message-State: APt69E3RPclDuV+5prSMl7vNHsdA7bPm9yKcyDmjKenbN7yupXiSdxx0
 yV1XVeBz9x2eP5OwjId7X6cBkhs3qlI=
X-Google-Smtp-Source: ADUXVKJfMzQ+1XE7KqxkDfQ0xyo3z/NSbvgOVd+49AHJyLFMXeCpmBAbtJtNN1U95LImFzqJJeMEtA==
X-Received: by 2002:a17:902:6ac3:: with SMTP id
 i3-v6mr5076783plt.378.1528437750036; 
 Thu, 07 Jun 2018 23:02:30 -0700 (PDT)
Received: from mail-pf0-f182.google.com (mail-pf0-f182.google.com.
 [209.85.192.182])
 by smtp.gmail.com with ESMTPSA id z28-v6sm62826861pfl.169.2018.06.07.23.02.29
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 07 Jun 2018 23:02:29 -0700 (PDT)
Received: by mail-pf0-f182.google.com with SMTP id b17-v6so6088297pfi.0;
 Thu, 07 Jun 2018 23:02:29 -0700 (PDT)
X-Received: by 2002:a62:303:: with SMTP id 3-v6mr4587743pfd.255.1528437749698; 
 Thu, 07 Jun 2018 23:02:29 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a17:90a:1581:0:0:0:0 with HTTP; Thu, 7 Jun 2018 23:02:29
 -0700 (PDT)
In-Reply-To: <CAGudoHENGpCxn_omxfaRLOAH5fP6qdFcmmqZ7He+pcC=-1HFKQ@mail.gmail.com>
References: <201806061257.w56CvCwq089369@repo.freebsd.org>
 <20180606140311.GU2450@kib.kiev.ua> <20180608033242.GA54099@pesky>
 <CAHgpiFyOQf6B3oGFGMz3avXqrrP0i6Puy9HqLER1XG5xE67BeQ@mail.gmail.com>
 <CAGudoHENGpCxn_omxfaRLOAH5fP6qdFcmmqZ7He+pcC=-1HFKQ@mail.gmail.com>
From: Ryan Libby <rlibby@freebsd.org>
Date: Thu, 7 Jun 2018 23:02:29 -0700
X-Gmail-Original-Message-ID: <CAHgpiFzMQjHfnQLQrWc86FSxB2veHZeAc44qmROkaJugpGoU=g@mail.gmail.com>
Message-ID: <CAHgpiFzMQjHfnQLQrWc86FSxB2veHZeAc44qmROkaJugpGoU=g@mail.gmail.com>
Subject: Re: svn commit: r334708 - head/sys/kern
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mark Johnston <markj@freebsd.org>,
 Konstantin Belousov <kostikbel@gmail.com>, 
 Justin Hibbits <jhibbits@freebsd.org>,
 src-committers <src-committers@freebsd.org>, 
 svn-src-all@freebsd.org, svn-src-head@freebsd.org
Content-Type: text/plain; charset="UTF-8"
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.26
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Jun 2018 06:09:00 -0000

On Thu, Jun 7, 2018 at 10:03 PM, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On Fri, Jun 8, 2018 at 6:29 AM, Ryan Libby <rlibby@freebsd.org> wrote:
>>
>> On Thu, Jun 7, 2018 at 8:32 PM, Mark Johnston <markj@freebsd.org> wrote:
>> > On Wed, Jun 06, 2018 at 05:03:11PM +0300, Konstantin Belousov wrote:
>> >> On Wed, Jun 06, 2018 at 12:57:12PM +0000, Justin Hibbits wrote:
>> >> > Author: jhibbits
>> >> > Date: Wed Jun  6 12:57:11 2018
>> >> > New Revision: 334708
>> >> > URL: https://svnweb.freebsd.org/changeset/base/334708
>> >> >
>> >> > Log:
>> >> >   Add a memory barrier after taking a reference on the vnode holdcnt
>> >> > in _vhold
>> >> >
>> >> >   This is needed to avoid a race between the VNASSERT() below, and
>> >> > another
>> >> >   thread updating the VI_FREE flag, on weakly-ordered architectures.
>> >> >
>> >> >   On a 72-thread POWER9, without this barrier a 'make -j72
>> >> > buildworld' would
>> >> >   panic on the assert regularly.
>> >> >
>> >> >   It may be possible to use a weaker barrier, and I'll investigate
>> >> > that once
>> >> >   all stability issues are worked out on POWER9.
>> >> >
>> >> > Modified:
>> >> >   head/sys/kern/vfs_subr.c
>> >> >
>> >> > Modified: head/sys/kern/vfs_subr.c
>> >> >
>> >> > ==============================================================================
>> >> > --- head/sys/kern/vfs_subr.c        Wed Jun  6 10:46:24 2018
>> >> > (r334707)
>> >> > +++ head/sys/kern/vfs_subr.c        Wed Jun  6 12:57:11 2018
>> >> > (r334708)
>> >> > @@ -2807,6 +2807,9 @@ _vhold(struct vnode *vp, bool locked)
>> >> >     CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
>> >> >     if (!locked) {
>> >> >             if (refcount_acquire_if_not_zero(&vp->v_holdcnt)) {
>> >> > +#if !defined(__amd64__) && !defined(__i386__)
>> >> > +                   mb();
>> >> > +#endif
>> >> >                     VNASSERT((vp->v_iflag & VI_FREE) == 0, vp,
>> >> >                         ("_vhold: vnode with holdcnt is free"));
>> >> >                     return;
>> >> First, mb() must not be used in the FreeBSD code at all.
>> >> Look at atomic_thread_fenceXXX(9) KPI.
>> >>
>> >> Second, you need the reciprocal fence between clearing of VI_FREE and
>> >> refcount_acquire(), otherwise the added barrier is nop.  Most likely,
>> >> you got away with it because there is a mutex unlock between clearing
>> >> of VI_FREE and acquire, which release semantic you abused.
>> >
>> > I note that vnlru_free_locked() clears VI_FREE and increments v_holdcnt
>> > without an intervening release fence. At this point the caller has not
>> > purged the vnode from the name cache, so it seems possible that the
>> > panicking thread observed the two stores out of order. In particular, it
>> > seems to me that the patch below is necessary, but possibly (probably?)
>> > not sufficient:
>>
>> Mark, Justin, and I looked at this.
>>
>> I think that the VNASSERT itself is not quite valid, since it is
>> checking the value of v_iflag without the vnode interlock held (and
>> without the vop lock also).  If the rule is that we normally need the
>> vnode interlock to check VI_FREE, then I don't think that possible
>> reordering between the refcount_acquire and VI_FREE clearing in
>> vnlru_free_locked is necessarily a problem in production.
>
>
> Checking it without any locks is perfectly valid in this case. It is done
> after v_holdcnt gets bumped from a non-zero value. So at that time it
> is at least two. Of course that result is stale as an arbitrary number of
> other threads could have bumped and dropped the ref past that point.
> The minimum value is 1 since we hold the ref. But this means the
> vnode must not be on the free list and that's what the assertion is
> verifying.
>
> The problem is indeed lack of ordering against the code clearing the
> flag for the case where 2 threads to vhold and one does the 0->1
> transition.
>
> That said, the fence is required for the assertion to work.
>

Yeah, I agree with this logic.  What I mean is that reordering between
v_holdcnt 0->1 and v_iflag is normally settled by the release and
acquisition of the vnode interlock, which we are supposed to hold for
v_*i*flag.  A quick scan seems to show all of the checks of VI_FREE that
are not asserts do hold the vnode interlock.  So, I'm just saying that I
don't think the possible reordering affects them.

>>
>>
>> It might just be that unlocked assertions about v_iflag actually need
>> additional synchronization (although it would be a little unfortunate to
>> add synchronization only to INVARIANTS builds).
>>
>> >
>> >> Does the fence needed for the non-invariants case ?
>> >>
>> >> Fourth, doesn't v_usecount has the same issues WRT inactivation ?
>> >
>> > diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
>> > index 286a871c3631..c97a8ba63612 100644
>> > --- a/sys/kern/vfs_subr.c
>> > +++ b/sys/kern/vfs_subr.c
>> > @@ -1018,6 +1018,7 @@ vnlru_free_locked(int count, struct vfsops
>> > *mnt_op)
>> >                  */
>> >                 freevnodes--;
>> >                 vp->v_iflag &= ~VI_FREE;
>> > +               atomic_thread_fence_rel();
>
>
> This probably has no added value for non-debug kernels.
>
>>
>> >                 refcount_acquire(&vp->v_holdcnt);
>> >
>> >                 mtx_unlock(&vnode_free_list_mtx);
>> > @@ -2807,9 +2808,7 @@ _vhold(struct vnode *vp, bool locked)
>> >         CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
>> >         if (!locked) {
>> >                 if (refcount_acquire_if_not_zero(&vp->v_holdcnt)) {
>> > -#if !defined(__amd64__) && !defined(__i386__)
>> > -                       mb();
>> > -#endif
>> > +                       atomic_thread_fence_acq();
>
>
> Same as above.
>
>>
>> >                         VNASSERT((vp->v_iflag & VI_FREE) == 0, vp,
>> >                             ("_vhold: vnode with holdcnt is free"));
>> >                         return;
>
>
> The main codepath which runs into this (... -> cache_lookup -> vhold) most
> definitely does not need the fence for production operation.
>
> What is unclear without audit is whether there are vhold users which need
> one. I think the patch is fine to go in if the other VI_FREE place gets a
> comment about a fence stemming from mtx_unlock and there is another one
> prior to the assertion explaining that this orders against v_iflag tests and
> may
> or may not be needed for other consumers.
>
> In general the code is *full* of data races and accidental reliance of
> ordering
> provided by amd64. Weeding this all out will be a painful exercise.
>
> Part of the problem is lack of primitives like READ_ONCE/WRITE_ONCE as
> seen in the linux kernel, someone should hack up equivalents.
>
> --
> Mateusz Guzik <mjguzik gmail.com>

Yeah, I think we should understand whether the problem is just the
unsynchronized assertions or whether there are some true bugs.  It
doesn't seem great to me to add synchronization which as you said

> This probably has no added value for non-debug kernels.