From owner-freebsd-arch@FreeBSD.ORG  Sat Feb  2 18:34:40 2013
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 0B0E4DBB;
 Sat,  2 Feb 2013 18:34:40 +0000 (UTC) (envelope-from alc@rice.edu)
Received: from mh3.mail.rice.edu (mh3.mail.rice.edu [128.42.199.10])
 by mx1.freebsd.org (Postfix) with ESMTP id C7C1AF7E;
 Sat,  2 Feb 2013 18:34:39 +0000 (UTC)
Received: from mh3.mail.rice.edu (localhost.localdomain [127.0.0.1])
 by mh3.mail.rice.edu (Postfix) with ESMTP id 9628740199;
 Sat,  2 Feb 2013 12:34:33 -0600 (CST)
Received: from mh3.mail.rice.edu (localhost.localdomain [127.0.0.1])
 by mh3.mail.rice.edu (Postfix) with ESMTP id 9471D40183;
 Sat,  2 Feb 2013 12:34:33 -0600 (CST)
X-Virus-Scanned: by amavis-2.7.0 at mh3.mail.rice.edu, auth channel
Received: from mh3.mail.rice.edu ([127.0.0.1])
 by mh3.mail.rice.edu (mh3.mail.rice.edu [127.0.0.1]) (amavis, port 10026)
 with ESMTP id NRwx4diKBFgP; Sat,  2 Feb 2013 12:34:33 -0600 (CST)
Received: from adsl-216-63-78-18.dsl.hstntx.swbell.net
 (adsl-216-63-78-18.dsl.hstntx.swbell.net [216.63.78.18])
 (using TLSv1 with cipher RC4-MD5 (128/128 bits))
 (No client certificate requested) (Authenticated sender: alc)
 by mh3.mail.rice.edu (Postfix) with ESMTPSA id DB61640182;
 Sat,  2 Feb 2013 12:34:32 -0600 (CST)
Message-ID: <510D5C37.6000507@rice.edu>
Date: Sat, 02 Feb 2013 12:34:31 -0600
From: Alan Cox <alc@rice.edu>
User-Agent: Mozilla/5.0 (X11; FreeBSD i386;
 rv:17.0) Gecko/20130127 Thunderbird/17.0.2
MIME-Version: 1.0
To: Andriy Gapon <avg@FreeBSD.org>
Subject: Re: kva size on amd64
References: <507E7E59.8060201@FreeBSD.org> <51098743.2050603@FreeBSD.org>
 <CAJUyCcOvHXauk76LnahQPGmdcHbkDOiR1_=4w+DW=sZ6i6EJ+A@mail.gmail.com>
 <510A2C09.6030709@FreeBSD.org> <510AB848.3010806@rice.edu>
 <510B8F2B.5070609@FreeBSD.org>
In-Reply-To: <510B8F2B.5070609@FreeBSD.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: alc@FreeBSD.org, Alan Cox <alan.l.cox@gmail.com>, freebsd-arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Feb 2013 18:34:40 -0000

On 02/01/2013 03:47, Andriy Gapon wrote:
> on 31/01/2013 20:30 Alan Cox said the following:
>> Try developing a different allocation strategy for the kmem_map. 
>> First-fit is clearly not working well for the ZFS ARC, because of
>> fragmentation.  For example, instead of further enlarging the kmem_map,
>> try splitting it into multiple submaps of the same total size,
>> kmem_map1, kmem_map2, etc.  Then, utilize these akin to the "old" and
>> "new" spaces of a copying garbage collector or storage segments in a
>> log-structured file system.  However, actual copying from an "old" space
>> to a "new" space may not be necessary.  By the time that the "new" space
>> from which you are currently allocating fills up or becomes sufficiently
>> fragmented that you can't satisfy an allocation, you've likely created
>> enough contiguous space in an "old" space.
>>
>> I'll hypothesize that just a couple kmem_map submaps that are .625 of
>> physical memory size would suffice.  The bottom line is that the total
>> virtual address space should be less than 2x physical memory.
>>
>> In fact, maybe the system starts off with just a single kmem_map, and
>> you only create additional kmem_maps on demand.  As someone who doesn't
>> use ZFS that would actually save me physical memory that is currently
>> being wasted on unnecessary preallocated page table pages for my
>> kmem_map.  This begins to sound like option (1) that you propose above.
>>
>> This might also help to keep physical memory fragmentation in check.
> Alan,
>
> very interesting suggestions, thank you!
>
> Of course, this is quite a bit more work than just jacking up some limit :-)
> So, it could be a while before any code materializes.
>
> Actually, I have been obsessed quite for some time with an idea of confining ZFS
> to its own submap.  But ZFS does its allocations through malloc(9) and uma(9)
> (depending on configuration). It seemed like a bit of work to provide support
> for per-zone or per-tag submaps in uma and malloc.
> What is your opinion of this approach?

I'm skeptical that it would accomplish anything.  Specifically, I don't
think that it would have any impact on the fragmentation problem that we
have with ZFS.  On amd64, with its direct map, all small allocations are
implemented by uma_small_alloc() and accessed through the direct map,
rather than coming from the kmem map.  Outside of ZFS, large, multipage
allocations from the kmem map aren't that common.  So, for all practical
purposes, ZFS has the kmem map to itself.

While I'm here, I'll offer some other food for thought.  In HEAD, we
have a new-ish function, vm_page_alloc_contig(), that can allocate
contiguous, unmapped physical pages either to an arbitrary vm object or
VM_ALLOC_NOOBJ, just like vm_page_alloc().  Moreover, just like
vm_page_alloc(), it honors the VM_ALLOC_{NORMAL,SYSTEM,INTERRUPT}
thresholds and wakes the page daemon when appropriate.

Using this function, you could rewrite the multipage allocation code to
first attempt allocation through vm_page_alloc_contig() and then fall
back to the kmem map only if vm_page_alloc_contig() fails.

> P.S.
> BTW, do I understand correctly that the reservation of kernel page tables
> happens through vm_map_insert -> pmap_growkernel ?
>

I believe kib@ already answered this, but, yes, that is correct.