Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 8 Jun 2018 20:37:55 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Ryan Libby <rlibby@freebsd.org>
Cc:        Mateusz Guzik <mjguzik@gmail.com>, Mark Johnston <markj@freebsd.org>, Justin Hibbits <jhibbits@freebsd.org>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r334708 - head/sys/kern
Message-ID:  <20180608173755.GJ2450@kib.kiev.ua>
In-Reply-To: <CAHgpiFzMQjHfnQLQrWc86FSxB2veHZeAc44qmROkaJugpGoU=g@mail.gmail.com>
References:  <201806061257.w56CvCwq089369@repo.freebsd.org> <20180606140311.GU2450@kib.kiev.ua> <20180608033242.GA54099@pesky> <CAHgpiFyOQf6B3oGFGMz3avXqrrP0i6Puy9HqLER1XG5xE67BeQ@mail.gmail.com> <CAGudoHENGpCxn_omxfaRLOAH5fP6qdFcmmqZ7He%2BpcC=-1HFKQ@mail.gmail.com> <CAHgpiFzMQjHfnQLQrWc86FSxB2veHZeAc44qmROkaJugpGoU=g@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 07, 2018 at 11:02:29PM -0700, Ryan Libby wrote:
> On Thu, Jun 7, 2018 at 10:03 PM, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > On Fri, Jun 8, 2018 at 6:29 AM, Ryan Libby <rlibby@freebsd.org> wrote:
> >>
> >> On Thu, Jun 7, 2018 at 8:32 PM, Mark Johnston <markj@freebsd.org> wrote:
> >> > On Wed, Jun 06, 2018 at 05:03:11PM +0300, Konstantin Belousov wrote:
> >> >> On Wed, Jun 06, 2018 at 12:57:12PM +0000, Justin Hibbits wrote:
> >> >> > Author: jhibbits
> >> >> > Date: Wed Jun  6 12:57:11 2018
> >> >> > New Revision: 334708
> >> >> > URL: https://svnweb.freebsd.org/changeset/base/334708
> >> >> >
> >> >> > Log:
> >> >> >   Add a memory barrier after taking a reference on the vnode holdcnt
> >> >> > in _vhold
> >> >> >
> >> >> >   This is needed to avoid a race between the VNASSERT() below, and
> >> >> > another
> >> >> >   thread updating the VI_FREE flag, on weakly-ordered architectures.
> >> >> >
> >> >> >   On a 72-thread POWER9, without this barrier a 'make -j72
> >> >> > buildworld' would
> >> >> >   panic on the assert regularly.
> >> >> >
> >> >> >   It may be possible to use a weaker barrier, and I'll investigate
> >> >> > that once
> >> >> >   all stability issues are worked out on POWER9.
> >> >> >
> >> >> > Modified:
> >> >> >   head/sys/kern/vfs_subr.c
> >> >> >
> >> >> > Modified: head/sys/kern/vfs_subr.c
> >> >> >
> >> >> > ==============================================================================
> >> >> > --- head/sys/kern/vfs_subr.c        Wed Jun  6 10:46:24 2018
> >> >> > (r334707)
> >> >> > +++ head/sys/kern/vfs_subr.c        Wed Jun  6 12:57:11 2018
> >> >> > (r334708)
> >> >> > @@ -2807,6 +2807,9 @@ _vhold(struct vnode *vp, bool locked)
> >> >> >     CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
> >> >> >     if (!locked) {
> >> >> >             if (refcount_acquire_if_not_zero(&vp->v_holdcnt)) {
> >> >> > +#if !defined(__amd64__) && !defined(__i386__)
> >> >> > +                   mb();
> >> >> > +#endif
> >> >> >                     VNASSERT((vp->v_iflag & VI_FREE) == 0, vp,
> >> >> >                         ("_vhold: vnode with holdcnt is free"));
> >> >> >                     return;
> >> >> First, mb() must not be used in the FreeBSD code at all.
> >> >> Look at atomic_thread_fenceXXX(9) KPI.
> >> >>
> >> >> Second, you need the reciprocal fence between clearing of VI_FREE and
> >> >> refcount_acquire(), otherwise the added barrier is nop.  Most likely,
> >> >> you got away with it because there is a mutex unlock between clearing
> >> >> of VI_FREE and acquire, which release semantic you abused.
> >> >
> >> > I note that vnlru_free_locked() clears VI_FREE and increments v_holdcnt
> >> > without an intervening release fence. At this point the caller has not
> >> > purged the vnode from the name cache, so it seems possible that the
> >> > panicking thread observed the two stores out of order. In particular, it
> >> > seems to me that the patch below is necessary, but possibly (probably?)
> >> > not sufficient:
> >>
> >> Mark, Justin, and I looked at this.
> >>
> >> I think that the VNASSERT itself is not quite valid, since it is
> >> checking the value of v_iflag without the vnode interlock held (and
> >> without the vop lock also).  If the rule is that we normally need the
> >> vnode interlock to check VI_FREE, then I don't think that possible
> >> reordering between the refcount_acquire and VI_FREE clearing in
> >> vnlru_free_locked is necessarily a problem in production.
> >
> >
> > Checking it without any locks is perfectly valid in this case. It is done
> > after v_holdcnt gets bumped from a non-zero value. So at that time it
> > is at least two. Of course that result is stale as an arbitrary number of
> > other threads could have bumped and dropped the ref past that point.
> > The minimum value is 1 since we hold the ref. But this means the
> > vnode must not be on the free list and that's what the assertion is
> > verifying.
> >
> > The problem is indeed lack of ordering against the code clearing the
> > flag for the case where 2 threads to vhold and one does the 0->1
> > transition.
> >
> > That said, the fence is required for the assertion to work.
> >
> 
> Yeah, I agree with this logic.  What I mean is that reordering between
> v_holdcnt 0->1 and v_iflag is normally settled by the release and
> acquisition of the vnode interlock, which we are supposed to hold for
> v_*i*flag.  A quick scan seems to show all of the checks of VI_FREE that
> are not asserts do hold the vnode interlock.  So, I'm just saying that I
> don't think the possible reordering affects them.
But do we know that only VI_FREE checks are affected ?

My concern is that users of _vhold() rely on seeing up to date state of the
vnode, and VI_FREE is only an example of the problem.  Most likely, the
code which fetched the vnode pointer before _vhold() call, should guarantee
visibility.

> 
> >>
> >>
> >> It might just be that unlocked assertions about v_iflag actually need
> >> additional synchronization (although it would be a little unfortunate to
> >> add synchronization only to INVARIANTS builds).
> >>
> >> >
> >> >> Does the fence needed for the non-invariants case ?
> >> >>
> >> >> Fourth, doesn't v_usecount has the same issues WRT inactivation ?
> >> >
> >> > diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
> >> > index 286a871c3631..c97a8ba63612 100644
> >> > --- a/sys/kern/vfs_subr.c
> >> > +++ b/sys/kern/vfs_subr.c
> >> > @@ -1018,6 +1018,7 @@ vnlru_free_locked(int count, struct vfsops
> >> > *mnt_op)
> >> >                  */
> >> >                 freevnodes--;
> >> >                 vp->v_iflag &= ~VI_FREE;
> >> > +               atomic_thread_fence_rel();
> >
> >
> > This probably has no added value for non-debug kernels.
> >
> >>
> >> >                 refcount_acquire(&vp->v_holdcnt);
> >> >
> >> >                 mtx_unlock(&vnode_free_list_mtx);
> >> > @@ -2807,9 +2808,7 @@ _vhold(struct vnode *vp, bool locked)
> >> >         CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
> >> >         if (!locked) {
> >> >                 if (refcount_acquire_if_not_zero(&vp->v_holdcnt)) {
> >> > -#if !defined(__amd64__) && !defined(__i386__)
> >> > -                       mb();
> >> > -#endif
> >> > +                       atomic_thread_fence_acq();
> >
> >
> > Same as above.
> >
> >>
> >> >                         VNASSERT((vp->v_iflag & VI_FREE) == 0, vp,
> >> >                             ("_vhold: vnode with holdcnt is free"));
> >> >                         return;
> >
> >
> > The main codepath which runs into this (... -> cache_lookup -> vhold) most
> > definitely does not need the fence for production operation.
> >
> > What is unclear without audit is whether there are vhold users which need
> > one. I think the patch is fine to go in if the other VI_FREE place gets a
> > comment about a fence stemming from mtx_unlock and there is another one
> > prior to the assertion explaining that this orders against v_iflag tests and
> > may
> > or may not be needed for other consumers.
> >
> > In general the code is *full* of data races and accidental reliance of
> > ordering
> > provided by amd64. Weeding this all out will be a painful exercise.
> >
> > Part of the problem is lack of primitives like READ_ONCE/WRITE_ONCE as
> > seen in the linux kernel, someone should hack up equivalents.
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>
> 
> Yeah, I think we should understand whether the problem is just the
> unsynchronized assertions or whether there are some true bugs.  It
> doesn't seem great to me to add synchronization which as you said
> 
> > This probably has no added value for non-debug kernels.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180608173755.GJ2450>