From owner-freebsd-current@FreeBSD.ORG  Thu Dec 15 23:41:37 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F22361065676
	for <current@FreeBSD.org>; Thu, 15 Dec 2011 23:41:36 +0000 (UTC)
	(envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242])
	by mx1.freebsd.org (Postfix) with ESMTP id BE3378FC1F
	for <current@FreeBSD.org>; Thu, 15 Dec 2011 23:41:36 +0000 (UTC)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id pBFNUqDe063464;
	Thu, 15 Dec 2011 15:30:56 -0800 (PST)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <201112152330.pBFNUqDe063464@gw.catspoiler.org>
Date: Thu, 15 Dec 2011 15:30:52 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
To: phk@phk.freebsd.dk
In-Reply-To: <1732.1323872049@critter.freebsd.dk>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
Cc: seanbru@yahoo-inc.com, current@FreeBSD.org
Subject: Re: dogfooding over in clusteradm land
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Dec 2011 23:41:37 -0000

On 14 Dec, Poul-Henning Kamp wrote:
> In message <1323868832.5283.9.camel@hitfishpass-lx.corp.yahoo.com>, Sean Bruno 
> writes:
> 
>>We're seeing what looks like a syncher/ufs resource starvation on 9.0 on
>>the cvs2svn ports conversion box.  I'm not sure what resource is tapped
>>out.
> 
> Search mailarcive for "lemming-syncer"

That should only produce a slowdown every 30 seconds but not cause a
deadlock.

I'd be more suspicious of a memory allocation deadlock.  This can happen
if the system runs short of free memory because there are a large number
of dirty buffers, but it needs to allocate some memory to flush the
buffers to disk.

This could be more likely to happen if you are using a software raid
layer, but I suspect that the recent change to the default UFS block
size from 16K to 32K is the culprit.  In another thread bde pointed out
that the BKVASIZE definition in sys/param.h hadn't been updated to match
the new default UFS block size.

 * BKVASIZE -	Nominal buffer space per buffer, in bytes.  BKVASIZE is the
 *		minimum KVM memory reservation the kernel is willing to make.
 *		Filesystems can of course request smaller chunks.  Actual 
 *		backing memory uses a chunk size of a page (PAGE_SIZE).
 *
 *		If you make BKVASIZE too small you risk seriously fragmenting
 *		the buffer KVM map which may slow things down a bit.  If you
 *		make it too big the kernel will not be able to optimally use 
 *		the KVM memory reserved for the buffer cache and will wind 
 *		up with too-few buffers.
 *
 *		The default is 16384, roughly 2x the block size used by a
 *		normal UFS filesystem.
 */
#define MAXBSIZE	65536	/* must be power of 2 */
#define BKVASIZE	16384	/* must be power of 2 */

The problem is that BKVASIZE is used in a number of the tuning
calculations in vfs_bio.c:

	/*
	 * The nominal buffer size (and minimum KVA allocation) is BKVASIZE.
	 * For the first 64MB of ram nominally allocate sufficient buffers to
	 * cover 1/4 of our ram.  Beyond the first 64MB allocate additional
	 * buffers to cover 1/10 of our ram over 64MB.  When auto-sizing
	 * the buffer cache we limit the eventual kva reservation to
	 * maxbcache bytes.
	 *
	 * factor represents the 1/4 x ram conversion.
	 */
	if (nbuf == 0) {
		int factor = 4 * BKVASIZE / 1024;

		nbuf = 50;
		if (physmem_est > 4096)
			nbuf += min((physmem_est - 4096) / factor,
			    65536 / factor);
		if (physmem_est > 65536)
			nbuf += (physmem_est - 65536) * 2 / (factor * 5);

		if (maxbcache && nbuf > maxbcache / BKVASIZE)
			nbuf = maxbcache / BKVASIZE;
		tuned_nbuf = 1;
	} else
		tuned_nbuf = 0;

	/* XXX Avoid unsigned long overflows later on with maxbufspace. */
	maxbuf = (LONG_MAX / 3) / BKVASIZE;


	/*
	 * maxbufspace is the absolute maximum amount of buffer space we are 
	 * allowed to reserve in KVM and in real terms.  The absolute maximum
	 * is nominally used by buf_daemon.  hibufspace is the nominal maximum
	 * used by most other processes.  The differential is required to 
	 * ensure that buf_daemon is able to run when other processes might 
	 * be blocked waiting for buffer space.
	 *
	 * maxbufspace is based on BKVASIZE.  Allocating buffers larger then
	 * this may result in KVM fragmentation which is not handled optimally
	 * by the system.
	 */
	maxbufspace = (long)nbuf * BKVASIZE;
	hibufspace = lmax(3 * maxbufspace / 4, maxbufspace - MAXBSIZE * 10);
	lobufspace = hibufspace - MAXBSIZE;


If you are using the new 32K default filesystem block size, then you may
be consuming twice as much memory for buffers than the tuning
calculations think you are using.  Increasing maxvnodes is probably the
wrong way to go, since it will increase memory pressure.

As a quick and dirty test, try cutting kern.nbuf in half.  The correct
fix is probably to rebuild the kernel with BKVASIZE doubled.