From owner-freebsd-stable@FreeBSD.ORG  Tue Sep 28 22:01:27 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F036B106566C;
	Tue, 28 Sep 2010 22:01:27 +0000 (UTC)
	(envelope-from ben@wanderview.com)
Received: from mail.wanderview.com (mail.wanderview.com [66.92.166.102])
	by mx1.freebsd.org (Postfix) with ESMTP id 6EE088FC0A;
	Tue, 28 Sep 2010 22:01:27 +0000 (UTC)
Received: from xykon.in.wanderview.com (xykon.in.wanderview.com [10.76.10.152])
	(authenticated bits=0)
	by mail.wanderview.com (8.14.4/8.14.4) with ESMTP id o8SM1LVX031742
	(version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO);
	Tue, 28 Sep 2010 22:01:21 GMT (envelope-from ben@wanderview.com)
Mime-Version: 1.0 (Apple Message framework v1081)
Content-Type: text/plain; charset=us-ascii
From: Ben Kelly <ben@wanderview.com>
In-Reply-To: <4CA25E92.4060904@icyb.net.ua>
Date: Tue, 28 Sep 2010 18:01:21 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <5BD33772-C0EA-48A9-BE9A-C8FBAF0008D7@wanderview.com>
References: <4CA1D06C.9050305@digiware.nl>
	<20100928115047.GA62142__15392.0458550148$1285675457$gmane$org@icarus.home.lan>
	<4CA1DDE9.8090107@icyb.net.ua>
	<20100928132355.GA63149@icarus.home.lan>
	<4CA1EF69.4040402@icyb.net.ua>
	<FE116FEC-714D-4BF5-86D8-E29BFA713C69@wanderview.com>
	<4CA21809.7090504@icyb.net.ua>
	<71D54408-4B97-4F7A-BD83-692D8D23461A@wanderview.com>
	<4CA22337.2010900@icyb.net.ua>
	<F244BA6D-3347-4D76-BAFB-D8B975783877@wanderview.com>
	<4CA25E92.4060904@icyb.net.ua>
To: Andriy Gapon <avg@icyb.net.ua>
X-Mailer: Apple Mail (2.1081)
X-Spam-Score: -1.01 () ALL_TRUSTED,T_RP_MATCHES_RCVD
X-Scanned-By: MIMEDefang 2.67 on 10.76.20.1
Cc: stable@freebsd.org, Willem Jan Withagen <wjw@digiware.nl>, fs@freebsd.org,
	Jeremy Chadwick <freebsd@jdc.parodius.com>
Subject: Re: Still getting kmem exhausted panic
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Sep 2010 22:01:28 -0000


On Sep 28, 2010, at 5:30 PM, Andriy Gapon wrote:

<< snipped lots of good info here... probably won't have time to look at =
it in detail until the weekend >>

>> there seems to be a layering violation in that the buffer cache =
signals
>> directly to the upper page daemon layer to trigger page reclamation.)
>=20
> Umm, not sure if that is a fact.

I was referring to the code in vfs_bio.c that used to twiddle =
vm_pageout_deficit directly.  That seems to have been replaced with a =
call to vm_page_grab().

>> The old (ancient) patch I tried previously to help reduce the arc =
working set
>> and allow it to shrink is here:
>>=20
>> http://www.wanderview.com/svn/public/misc/zfs/zfs_kmem_limit.diff
>>=20
>> Unfortunately, there are a couple ideas on fighting fragmentation =
mixed into
>> that patch.  See the part about arc_reclaim_pages().  This patch did =
seem to
>> allow my arc to stay under the target maximum even when under load =
that
>> previously caused the system to exceed the maximum.  When I update =
this
>> weekend I'll try a stripped down version of the patch to see if it =
helps or
>> not with the latest zfs.
>>=20
>> Thanks for your help in understanding this stuff!
>=20
> The patch seems good, especially the part about taking into account =
the kmem
> fragmentation.  But it also seems to be heavily tuned towards "tiny =
ARC" systems
> like yours, so I am not sure yet how suitable it is for "mainstream" =
systems.

Thanks.  Yea, there is a lot of aggressive tuning there.  In particular, =
the slow growth algorithm is somewhat dubious.  What I found, though, =
was that the fragmentation jumped whenever the arc was reduced in size, =
so it was an attempt to make the size slowly approach peak load without =
overshooting.

A better long term solution would probably be to enhance UMA to support =
custom slab sizes on a zone-by-zone basis.  That way all zfs/arc =
allocations can use slabs of 128k (at a memory efficiency penalty of =
course).  I prototyped this with a dumbed down block pool allocator at =
one point and was able to avoid most, if not all, of the fragmentation.  =
Adding the support to UMA seemed non-trivial, though.

Thanks again for the information.  I hope to get a chance to look at the =
code this weekend.

- Ben=