From owner-freebsd-current@FreeBSD.ORG  Sun Dec 23 21:42:40 2012
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 6391F3CB
 for <freebsd-current@freebsd.org>; Sun, 23 Dec 2012 21:42:40 +0000 (UTC)
 (envelope-from freebsd@damnhippie.dyndns.org)
Received: from duck.symmetricom.us (duck.symmetricom.us [206.168.13.214])
 by mx1.freebsd.org (Postfix) with ESMTP id 240F88FC0C
 for <freebsd-current@freebsd.org>; Sun, 23 Dec 2012 21:42:39 +0000 (UTC)
Received: from damnhippie.dyndns.org (daffy.symmetricom.us [206.168.13.218])
 by duck.symmetricom.us (8.14.5/8.14.5) with ESMTP id qBNLgcp3068942
 for <freebsd-current@freebsd.org>; Sun, 23 Dec 2012 14:42:38 -0700 (MST)
 (envelope-from freebsd@damnhippie.dyndns.org)
Received: from [172.22.42.240] (revolution.hippie.lan [172.22.42.240])
 by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id qBNLgODG070168;
 Sun, 23 Dec 2012 14:42:24 -0700 (MST)
 (envelope-from freebsd@damnhippie.dyndns.org)
Subject: Re: Why does sleep(1) end up blocked in bwillwrite()?
From: Ian Lepore <freebsd@damnhippie.dyndns.org>
To: Konstantin Belousov <kostikbel@gmail.com>
In-Reply-To: <20121223193726.GX53644@kib.kiev.ua>
References: <1356288915.1129.68.camel@revolution.hippie.lan>
 <20121223193726.GX53644@kib.kiev.ua>
Content-Type: text/plain; charset="us-ascii"
Date: Sun, 23 Dec 2012 14:42:24 -0700
Message-ID: <1356298944.1129.75.camel@revolution.hippie.lan>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port 
Content-Transfer-Encoding: 7bit
Cc: freebsd-current@freebsd.org
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Dec 2012 21:42:40 -0000

On Sun, 2012-12-23 at 21:37 +0200, Konstantin Belousov wrote:
> On Sun, Dec 23, 2012 at 11:55:15AM -0700, Ian Lepore wrote:
> > Background:  I'm trying to get nandfs working on a low-end small-memory
> > embedded system.  I'm debugging performance problems that manifest as
> > the system (or large portions of it) becoming unresponsive for many
> > seconds at a time.  It appears that sometimes the nandfs background
> > garbage collector does things that lead to dirtying lots of buffers way
> > faster than they can be written.  When that happens it seems to take too
> > long (many seconds) for the problem to clear.  That's the basic
> > situation I'm investigating, but NOT what this mail is about, that's
> > just the background.
> > 
> > When this situation happens, some of the threads in my application keep
> > running fine.  Others get blocked unexpectedly even though they do no
> > disk IO at all, they're working with sockets and serial (uart) devices.
> > I discovered by accident that I can see a form of the problem happening
> > just using sleep(1) and hitting ^T while the buffer starvation is in
> > progress...
> > 
> >   guava# sleep 999999
> > [ hit ^T]
> >   load: 1.03  cmd: sleep 472 [nanslp] 2.03r 0.01u 0.02s 0% 1372k
> >   sleep: about 999997 second(s) left out of the original 999999
> > [ hit ^T]
> >   load: 1.27  cmd: sleep 472 [nanslp] 9.32r 0.01u 0.02s 0% 1376k
> >   sleep: about 999989 second(s) left out of the original 999999
> > [ hit ^T]
> >   load: 1.49  cmd: sleep 472 [nanslp] 11.53r 0.01u 0.02s 0% 1376k
> > [ note no output from sleep(1) here, repeated ^T now gives...]
> >   load: 1.49  cmd: sleep 472 [flswai] 12.01r 0.01u 0.03s 0% 1376k
> >   load: 1.49  cmd: sleep 472 [flswai] 12.27r 0.01u 0.03s 0% 1376k
> >   load: 1.49  cmd: sleep 472 [flswai] 12.76r 0.01u 0.03s 0% 1376k
> >   load: 1.49  cmd: sleep 472 [flswai] 13.06r 0.01u 0.03s 0% 1376k
> >   load: 1.49  cmd: sleep 472 [flswai] 13.26r 0.01u 0.03s 0% 1376k
> >   load: 1.61  cmd: sleep 472 [flswai] 20.03r 0.02u 0.07s 0% 1376k
> >   load: 1.64  cmd: sleep 472 [flswai] 20.49r 0.02u 0.07s 0% 1376k
> >   load: 1.64  cmd: sleep 472 [flswai] 20.68r 0.02u 0.08s 0% 1376k
> >   sleep: about 999987 second(s) left out of the original 999999
> > 
> > So here sleep(1) was blocked in bwillwrite() for about 9 seconds on a
> > write to stderr (which is an ssh xterm connection).
> > 
> > The call to bwillwrite() is in kern/sys_generic.c in dofilewrite():
> > 
> > 	if (fp->f_type == DTYPE_VNODE)
> >               bwillwrite();
> > 
> > I just noticed the checkin message that added the DTYPE_VNODE check
> > specifically mentions not penalizing devices and pipes and such.  I
> > think maybe things have evolved since then (Dec 2000) and this check is
> > no longer sufficient.  Maybe it needs to be something more like
> > 
> > 	if (fp->f_type == DTYPE_VNODE && fp->f_vnode->v_type == VREG)
> > 
> > but I have a gut feeling it needs to be more complex than that (can
> > f_vnode be NULL, what sort of locking is required to peek into f_vnode
> > at this point, etc), so I can't really propose a patch for this.  In
> > fact, I can't even say for sure it's a bug, but it sure feels like one
> > to the application-developer part of me.
> 
> The patch below would do what you want. But my opinion is that it is more
> bug in the filesystem than in the VFS. Anyway, try this and report how
> it works for you.
> 

If by "bug in the filesystem" you mean the real problem is nandfs
driving the system into buffer starvation, then yes I agree... that's
the real problem I'm pursuing.  The difficulty I had was that anything I
did to try to investigate the state of the system resulted in blocking
when it tried to output to the terminal.  

I'm running with your patch now and it seems to be working perfectly,
thanks!

-- Ian

> diff --git a/sys/fs/devfs/devfs_vnops.c b/sys/fs/devfs/devfs_vnops.c
> index 97a1bcf..9851229 100644
> --- a/sys/fs/devfs/devfs_vnops.c
> +++ b/sys/fs/devfs/devfs_vnops.c
> @@ -1049,6 +1049,7 @@ devfs_open(struct vop_open_args *ap)
>  	int error, ref, vlocked;
>  	struct cdevsw *dsw;
>  	struct file *fpop;
> +	struct mtx *mtxp;
>  
>  	if (vp->v_type == VBLK)
>  		return (ENXIO);
> @@ -1099,6 +1100,16 @@ devfs_open(struct vop_open_args *ap)
>  #endif
>  	if (fp->f_ops == &badfileops)
>  		finit(fp, fp->f_flag, DTYPE_VNODE, dev, &devfs_ops_f);
> +	mtxp = mtx_pool_find(mtxpool_sleep, fp);
> +
> +	/*
> +	 * Hint to the dofilewrite() to not force the buffer draining
> +	 * on the writer to the file.  Most likely, the write would
> +	 * not need normal buffers.
> +	 */
> +	mtx_lock(mtxp);
> +	fp->f_vnread_flags |= FDEVFS_VNODE;
> +	mtx_unlock(mtxp);
>  	return (error);
>  }
>  
> diff --git a/sys/kern/sys_generic.c b/sys/kern/sys_generic.c
> index f47cb03..b97ff7f 100644
> --- a/sys/kern/sys_generic.c
> +++ b/sys/kern/sys_generic.c
> @@ -536,7 +536,8 @@ dofilewrite(td, fd, fp, auio, offset, flags)
>  		ktruio = cloneuio(auio);
>  #endif
>  	cnt = auio->uio_resid;
> -	if (fp->f_type == DTYPE_VNODE)
> +	if (fp->f_type == DTYPE_VNODE &&
> +	    (fp->f_vnread_flags & FDEVFS_VNODE) == 0)
>  		bwillwrite();
>  	if ((error = fo_write(fp, auio, td->td_ucred, flags, td))) {
>  		if (auio->uio_resid != cnt && (error == ERESTART ||
> diff --git a/sys/sys/file.h b/sys/sys/file.h
> index dc49895..cf5f1ea 100644
> --- a/sys/sys/file.h
> +++ b/sys/sys/file.h
> @@ -178,7 +178,8 @@ struct file {
>  #define	f_advice	f_vnun.fvn_advice
>  
>  #define	FOFFSET_LOCKED       0x1
> -#define	FOFFSET_LOCK_WAITING 0x2		 
> +#define	FOFFSET_LOCK_WAITING 0x2
> +#define	FDEVFS_VNODE	     0x4
>  
>  #endif /* _KERNEL || _WANT_FILE */
>