From owner-freebsd-fs@FreeBSD.ORG Sun Sep 25 22:23:52 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 570BC106566C; Sun, 25 Sep 2011 22:23:52 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id 580A58FC0C; Sun, 25 Sep 2011 22:23:50 +0000 (UTC) Received: by bkbzs8 with SMTP id zs8so6257710bkb.13 for ; Sun, 25 Sep 2011 15:23:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=from:to:cc:subject:references:x-comment-to:sender:date:message-id :user-agent:mime-version:content-type; bh=nVTseTzBHjAAJZGY79dc1b1Q/VQZAQ8f7WGJNfOr+KE=; b=coNH2oZuMLNODo4zGknIPGeOtiishYX3q+Xygbg34Xsi23NRX/Z+0Lpt7uyIES+rEQ M2SZBEx2DXy3g8/tMf/GzZ7P8HAaXGaYIknd+3BfQKowklIhykYYxVEh7rRSRYlDcd3D J60iO/FCA40wMxSIbWl6m4PF+N9HFUIZSl1S4= Received: by 10.204.133.7 with SMTP id d7mr4054865bkt.104.1316987887748; Sun, 25 Sep 2011 14:58:07 -0700 (PDT) Received: from localhost ([95.69.173.122]) by mx.google.com with ESMTPS id z7sm18631390bkt.5.2011.09.25.14.58.05 (version=TLSv1/SSLv3 cipher=OTHER); Sun, 25 Sep 2011 14:58:06 -0700 (PDT) From: Mikolaj Golub To: Robert Millan References: <201108102152.p7ALqUl4075207@red.freebsd.org> <201108102200.p7AM0Nu9026320@freefall.freebsd.org> X-Comment-To: Robert Millan Sender: Mikolaj Golub Date: Mon, 26 Sep 2011 00:58:03 +0300 Message-ID: <86k48wz3mc.fsf@kopusha.home.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Cc: Josef Karthauser , freebsd-bugs@freebsd.org, Adrian Chadd , freebsd-fs@freebsd.org, FreeBSD-gnats-submit@freebsd.org Subject: Re: kern/159663: sockets don't work though nullfs mounts X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 25 Sep 2011 22:23:52 -0000 --=-=-= Hi, On Sun, 25 Sep 2011 17:32:27 +0200 Robert Millan wrote: RM> 2011/9/24 Robert Millan : >> I found a thread from 2007 with further discussion about this problem: >> >> http://lists.freebsd.org/pipermail/freebsd-fs/2007-February/002669.html RM> Hi, RM> I've looked at the situation in a bit more detail, for now only with RM> sockets in mind (not named pipes). My understanding is (please RM> correct me if I'm wrong): RM> - nullfs holds reference counts for each vnode, but sockets have their RM> own mechanism for reference counting (so_count / soref / sorele). RM> vnode reference counting doesn't protect against socket being closed, RM> which would leave a stale pointer in the upper nullfs layer. RM> - Increasing the reference count of the socket itself can't be done in RM> null_nodeget() because this function is merely a getter whose call RM> doesn't indicate any meaningful event. RM> - It's not clear to me that there's any event in time where the socket RM> reference can be increased. If mounting a nullfs were that event, RM> then all existing sockets would be soref'ed but we wouldn't be RM> soref'ing future sockets created in the lower layer after the mount. RM> This doesn't seem correct. RM> - Possible solution: null_nodeget() semantics are replaced with RM> something that actually allows vnodes in the upper layer to be created RM> and destroyed. RM> - Possible solution: upper layer has a memory structure to keep track RM> of which sockets in the lower layer have been soref'ed. It looks like there is no need in setting vp->v_un = lowervp->v_un for VFIFO. They work without this modification bypassing vnode operations to lover node and lowervp->v_un is used. The issue is only with local sockets, because when bind or connnect is called for nullfs file the upper v_un is used. For me the approach "vp->v_un = lowervp->v_un" has many complications. May be it is much easier to use always only lower vnode? What we need for this is to make bind and connect get the lower vnode when they are called on nullfs file. As a proof of concept below is a patch that implements it. Currently I am not sure that vrele/vref magic is done properly, but it looks like it works for me. The issues with this approach I see so far: - we need an additional flag for namei; - nullfs can be unmounted with a socket file still being opened. -- Mikolaj Golub --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=nullfs.sockets.patch Index: sys/sys/namei.h =================================================================== --- sys/sys/namei.h (revision 225716) +++ sys/sys/namei.h (working copy) @@ -149,7 +149,8 @@ struct nameidata { #define AUDITVNODE1 0x04000000 /* audit the looked up vnode information */ #define AUDITVNODE2 0x08000000 /* audit the looked up vnode information */ #define TRAILINGSLASH 0x10000000 /* path ended in a slash */ -#define PARAMASK 0x1ffffe00 /* mask of parameter descriptors */ +#define LOWERVNODE 0x20000000 /* if it is a stackable fs return lower vnode */ +#define PARAMASK 0x3ffffe00 /* mask of parameter descriptors */ #define NDHASGIANT(NDP) (((NDP)->ni_cnd.cn_flags & GIANTHELD) != 0) Index: sys/kern/uipc_usrreq.c =================================================================== --- sys/kern/uipc_usrreq.c (revision 225716) +++ sys/kern/uipc_usrreq.c (working copy) @@ -493,7 +493,7 @@ uipc_bind(struct socket *so, struct sockaddr *nam, restart: vfslocked = 0; - NDINIT(&nd, CREATE, MPSAFE | NOFOLLOW | LOCKPARENT | SAVENAME, + NDINIT(&nd, CREATE, MPSAFE | NOFOLLOW | LOCKPARENT | SAVENAME | LOWERVNODE, UIO_SYSSPACE, buf, td); /* SHOULD BE ABLE TO ADOPT EXISTING AND wakeup() ALA FIFO's */ error = namei(&nd); @@ -1268,7 +1268,7 @@ unp_connect(struct socket *so, struct sockaddr *na UNP_PCB_UNLOCK(unp); sa = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK); - NDINIT(&nd, LOOKUP, MPSAFE | FOLLOW | LOCKLEAF, UIO_SYSSPACE, buf, + NDINIT(&nd, LOOKUP, MPSAFE | FOLLOW | LOCKLEAF | LOWERVNODE, UIO_SYSSPACE, buf, td); error = namei(&nd); if (error) Index: sys/fs/nullfs/null_vnops.c =================================================================== --- sys/fs/nullfs/null_vnops.c (revision 225756) +++ sys/fs/nullfs/null_vnops.c (working copy) @@ -365,16 +365,40 @@ null_lookup(struct vop_lookup_args *ap) vrele(lvp); } else { error = null_nodeget(dvp->v_mount, lvp, &vp); - if (error) + if (error) { vput(lvp); - else - *ap->a_vpp = vp; + } else if ((flags & LOWERVNODE) != 0) { + vref(lvp); + vrele(vp); + *ap->a_vpp = lvp; + } else { + *ap->a_vpp = vp; + } } } return (error); } static int +null_create(struct vop_create_args *ap) +{ + struct componentname *cnp = ap->a_cnp; + int flags = cnp->cn_flags; + int retval; + struct vnode *vp, *lvp; + + retval = null_bypass(&ap->a_gen); + if (retval == 0 && (flags & LOWERVNODE) != 0) { + vp = *ap->a_vpp; + lvp = NULLVPTOLOWERVP(vp); + vref(lvp); + vrele(vp); + *ap->a_vpp = lvp; + } + return (retval); +} + +static int null_open(struct vop_open_args *ap) { int retval; @@ -826,6 +850,7 @@ struct vop_vector null_vnodeops = { .vop_accessx = null_accessx, .vop_advlockpurge = vop_stdadvlockpurge, .vop_bmap = VOP_EOPNOTSUPP, + .vop_create = null_create, .vop_getattr = null_getattr, .vop_getwritemount = null_getwritemount, .vop_inactive = null_inactive, --=-=-=--