From owner-freebsd-current@FreeBSD.ORG  Thu Mar  7 17:04:03 2013
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 04189798
 for <freebsd-current@freebsd.org>; Thu,  7 Mar 2013 17:04:03 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 by mx1.freebsd.org (Postfix) with ESMTP id 3B11B76A
 for <freebsd-current@freebsd.org>; Thu,  7 Mar 2013 17:04:01 +0000 (UTC)
Received: (qmail 8033 invoked from network); 7 Mar 2013 18:17:28 -0000
Received: from unknown (HELO [62.48.0.94]) ([62.48.0.94])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <alc@rice.edu>; 7 Mar 2013 18:17:28 -0000
Message-ID: <5138C877.9060808@freebsd.org>
Date: Thu, 07 Mar 2013 18:03:51 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version: 1.0
To: Alan Cox <alc@rice.edu>
Subject: Re: Cleanup and untangling of kernel VM initialization
References: <510BC24D.2090406@freebsd.org> <510BF6E0.8070007@rice.edu>
In-Reply-To: <510BF6E0.8070007@rice.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: alc@freebsd.org, freebsd-current@freebsd.org, kib@freebsd.org
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 07 Mar 2013 17:04:03 -0000

On 01.02.2013 18:09, Alan Cox wrote:
> On 02/01/2013 07:25, Andre Oppermann wrote:
>>   Rebase auto-sizing of limits on the available KVM/kmem_map instead of
>> physical
>>   memory.  Depending on the kernel and architecture configuration these
>> two can
>>   be very different.
>>
>> Comments and reviews appreciated.
>>
>
> I would really like to see the issues with the current auto-sizing code
> addressed before any of the stylistic changes or en-masse conversions to
> SYSINIT()s are considered.  In particular, can we please start with the
> patch that moves the pipe_map initialization?  After that, I think that
> we should revisit tunable_mbinit() and "maxmbufmem".

OK.  I'm trying to describe and explain the big picture for myself and
other interested observers.  The following text and explanations are going
to be verbose and sometime redundant.  If something is incorrect or incomplete
please yell, I'm not an expert in all these parts and may easily have missed
some subtle aspects.

The kernel_map serves as the container of the entire available kernel VM
address space, including the kernel text, data and bss itself, as well as
other bootstrapped and pre-VM allocated structures.

The kernel_map should cover a reasonable large amount of address space to be
able to serve the various kernel subsystems demands in memory allocation.
The cpu architecture's address range (32 or 64 bits) puts a hard ceiling on
the total size of the kernel_map.  Depending on the architecture the kernel_map
covers a special range in the total addressable address range.

  * VM_MIN_KERNEL_ADDRESS
  *   [KERNBASE]
  *   kernel_map    [actually mapped KVM range, direct allocations]
  *   kernel text, data, bss
  *   bootstrap and statically allocated structures  [pmap]
  *   virtual_avail  [start of useable KVM]
  *       kmem_map   [submap for (most) UMA zones and kernel malloc]
  *       exec_map   [submap for temporary mapping during process exec()]
  *       pipe_map   [submap for temporary buffering of data between piped processes]
  *       clean_map  [submap for buffer_map and pager_map]
  *         buffer_map [submap for BIO buffers]
  *         pager_map  [submap for temporary pager IO holding]
  *       memguard_map [submap for debugging of UMA and kernel malloc]
  *       ...        [kernel_map direct allocations, free and unused space]
  *   kernel_map     [end of kernel_map]
  *   ...
  *   virtual_end    [end of possible KVM]
  * VM_MAX_KERNEL_ADDRESS

Some kernel_map's submaps are special by being non-pageable and by pre-allocating
the necessary pmap structures to avoid page faults.  The pre-allocation consumes
physical memory.  Thus a submap's pre-allocation should not be larger than a
reasonable small fraction of available physical memory to leave enough space for
other kernel and userspace memory demands.

The pseudo-code for a dynamic calculation of a submap size would look like this:

  submap.size = min(physmem.size / pmap.prealloc_max_fraction / pmap.size_per_page *
      page_size, kernel_map.free_size)

The pmap.prealloc_max_fraction is the largest fraction of physical memory we
allow the pre-allocated pmap structures of a single submap to occupy.

Separate submaps are usually used to segregate certain types of memory usage and
to have individual limits applied to them:

  kmem_map: tries to be as large as possible.  It serves the bulk of all dynamically
   allocated kernel memory usage.  It is the memory pool used by UMA and kernel malloc.
   Almost all kernel structures come from here: process-, thread-, file descriptors,
   mbuf's and mbuf clusters, network connection control blocks, sockets, etc...
   It is not pageable.
   Calculation: is currently only partially done dynamically and the MD parts can
    specify particular min, max limits and scaling factors.  It likely can be
    generalized and with only very special platforms requiring additional limits.

  exec_map: is used as temporary storage to set up a processes address space
   and related items.  It is very small and by default contains only 16 pages.
   Calculation: (exec_map_entries * round_page(PATH_MAX + ARG_MAX)).

  pipe_map: is used to move piped data between processes.  It is pageable
   memory.  Calculation: min(physmem.size, kernel_map.size) / 64.

  clean_map: overarching submap to contain the buffer_map and pager_map.  Likely
   no longer necessary and a leftover from earlier incarnations of the kernel VM.

  buffer_map: is used for BIO structures to perform IO between the kernel VM
   and storage media (disk).  Not pageable.
   Calculation: min(physmem.size, kernel_map.size) / 4 up to 64MB and 1/10 thereafter.

  pager_map: is used for pager IO to a storage media (disk).  Not pageable.
   Calculation: MAXPHYS * min(max(nbuf/4, 16), 256).

  memguard_map: is a special debugging submap substituting parts of kmem_map.
   Normally not used.

There is some competition between these maps for physical memory.  One has to
be careful to find a total balance among them wrt. static and dynamic physical
memory use.

Within the submaps, especially the kmem_map, we have a number of dynamic UMA
suballocators where we have to put a ceiling on their total memory usage to
prevent them to consume all physical *and/or* kmem_map virtual memory.  This is
done with UMA zone limits.

No externally exploitable single UMA zone should be able to consume all available
physical memory.  This applies for example to the number of processes, file
descriptors, sockets, mbufs and mbuf clusters.  These need to be limited to a
reasonable and heavy work-load permitting amount of available physical memory.
However there is going to be overcommit among them and not all them can be at their
limit at the same time.  Probably none of these UMA zones should be allowed to
occupy more than 1/2 of all available physical memory.  Often individual UMA zone
limits have to be put into context and related to other concurrent UMA zones.
This usually means reduced UMA zone limit for a particular zone.  Balancing this
takes a slight amount of voodoo magic and knowledge of common extreme work-loads
to align.  On the other hand for most of those zones allocations are permitted
to fail rendering an attempt at connection establishment unsuccessful.  It can
be retried later.

Generic pseudo-code:
  UMA zone limit = min(kmem_map.size, physmem.size) / 4 (or other appropriate fraction).

It could be that some of the kernel_map submaps are no longer necessary and their
purpose could simply be emulated by using an appropriately limited UMA zone.  For
example the exec_map is very small and only used for the exec arguments.  Putting
this into pageable memory isn't very useful anymore.

Also the interesting construct of the clean_map containing only the buffer_map and
pager_map doesn't seem necessary anymore and is probably remains of an earlier
incarnation of the VM.

Comments, discussion and additional input welcome.

-- 
Andre