From owner-freebsd-current@FreeBSD.ORG Thu Mar 7 17:04:03 2013 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 04189798 for ; Thu, 7 Mar 2013 17:04:03 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id 3B11B76A for ; Thu, 7 Mar 2013 17:04:01 +0000 (UTC) Received: (qmail 8033 invoked from network); 7 Mar 2013 18:17:28 -0000 Received: from unknown (HELO [62.48.0.94]) ([62.48.0.94]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 7 Mar 2013 18:17:28 -0000 Message-ID: <5138C877.9060808@freebsd.org> Date: Thu, 07 Mar 2013 18:03:51 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 MIME-Version: 1.0 To: Alan Cox Subject: Re: Cleanup and untangling of kernel VM initialization References: <510BC24D.2090406@freebsd.org> <510BF6E0.8070007@rice.edu> In-Reply-To: <510BF6E0.8070007@rice.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: alc@freebsd.org, freebsd-current@freebsd.org, kib@freebsd.org X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Mar 2013 17:04:03 -0000 On 01.02.2013 18:09, Alan Cox wrote: > On 02/01/2013 07:25, Andre Oppermann wrote: >> Rebase auto-sizing of limits on the available KVM/kmem_map instead of >> physical >> memory. Depending on the kernel and architecture configuration these >> two can >> be very different. >> >> Comments and reviews appreciated. >> > > I would really like to see the issues with the current auto-sizing code > addressed before any of the stylistic changes or en-masse conversions to > SYSINIT()s are considered. In particular, can we please start with the > patch that moves the pipe_map initialization? After that, I think that > we should revisit tunable_mbinit() and "maxmbufmem". OK. I'm trying to describe and explain the big picture for myself and other interested observers. The following text and explanations are going to be verbose and sometime redundant. If something is incorrect or incomplete please yell, I'm not an expert in all these parts and may easily have missed some subtle aspects. The kernel_map serves as the container of the entire available kernel VM address space, including the kernel text, data and bss itself, as well as other bootstrapped and pre-VM allocated structures. The kernel_map should cover a reasonable large amount of address space to be able to serve the various kernel subsystems demands in memory allocation. The cpu architecture's address range (32 or 64 bits) puts a hard ceiling on the total size of the kernel_map. Depending on the architecture the kernel_map covers a special range in the total addressable address range. * VM_MIN_KERNEL_ADDRESS * [KERNBASE] * kernel_map [actually mapped KVM range, direct allocations] * kernel text, data, bss * bootstrap and statically allocated structures [pmap] * virtual_avail [start of useable KVM] * kmem_map [submap for (most) UMA zones and kernel malloc] * exec_map [submap for temporary mapping during process exec()] * pipe_map [submap for temporary buffering of data between piped processes] * clean_map [submap for buffer_map and pager_map] * buffer_map [submap for BIO buffers] * pager_map [submap for temporary pager IO holding] * memguard_map [submap for debugging of UMA and kernel malloc] * ... [kernel_map direct allocations, free and unused space] * kernel_map [end of kernel_map] * ... * virtual_end [end of possible KVM] * VM_MAX_KERNEL_ADDRESS Some kernel_map's submaps are special by being non-pageable and by pre-allocating the necessary pmap structures to avoid page faults. The pre-allocation consumes physical memory. Thus a submap's pre-allocation should not be larger than a reasonable small fraction of available physical memory to leave enough space for other kernel and userspace memory demands. The pseudo-code for a dynamic calculation of a submap size would look like this: submap.size = min(physmem.size / pmap.prealloc_max_fraction / pmap.size_per_page * page_size, kernel_map.free_size) The pmap.prealloc_max_fraction is the largest fraction of physical memory we allow the pre-allocated pmap structures of a single submap to occupy. Separate submaps are usually used to segregate certain types of memory usage and to have individual limits applied to them: kmem_map: tries to be as large as possible. It serves the bulk of all dynamically allocated kernel memory usage. It is the memory pool used by UMA and kernel malloc. Almost all kernel structures come from here: process-, thread-, file descriptors, mbuf's and mbuf clusters, network connection control blocks, sockets, etc... It is not pageable. Calculation: is currently only partially done dynamically and the MD parts can specify particular min, max limits and scaling factors. It likely can be generalized and with only very special platforms requiring additional limits. exec_map: is used as temporary storage to set up a processes address space and related items. It is very small and by default contains only 16 pages. Calculation: (exec_map_entries * round_page(PATH_MAX + ARG_MAX)). pipe_map: is used to move piped data between processes. It is pageable memory. Calculation: min(physmem.size, kernel_map.size) / 64. clean_map: overarching submap to contain the buffer_map and pager_map. Likely no longer necessary and a leftover from earlier incarnations of the kernel VM. buffer_map: is used for BIO structures to perform IO between the kernel VM and storage media (disk). Not pageable. Calculation: min(physmem.size, kernel_map.size) / 4 up to 64MB and 1/10 thereafter. pager_map: is used for pager IO to a storage media (disk). Not pageable. Calculation: MAXPHYS * min(max(nbuf/4, 16), 256). memguard_map: is a special debugging submap substituting parts of kmem_map. Normally not used. There is some competition between these maps for physical memory. One has to be careful to find a total balance among them wrt. static and dynamic physical memory use. Within the submaps, especially the kmem_map, we have a number of dynamic UMA suballocators where we have to put a ceiling on their total memory usage to prevent them to consume all physical *and/or* kmem_map virtual memory. This is done with UMA zone limits. No externally exploitable single UMA zone should be able to consume all available physical memory. This applies for example to the number of processes, file descriptors, sockets, mbufs and mbuf clusters. These need to be limited to a reasonable and heavy work-load permitting amount of available physical memory. However there is going to be overcommit among them and not all them can be at their limit at the same time. Probably none of these UMA zones should be allowed to occupy more than 1/2 of all available physical memory. Often individual UMA zone limits have to be put into context and related to other concurrent UMA zones. This usually means reduced UMA zone limit for a particular zone. Balancing this takes a slight amount of voodoo magic and knowledge of common extreme work-loads to align. On the other hand for most of those zones allocations are permitted to fail rendering an attempt at connection establishment unsuccessful. It can be retried later. Generic pseudo-code: UMA zone limit = min(kmem_map.size, physmem.size) / 4 (or other appropriate fraction). It could be that some of the kernel_map submaps are no longer necessary and their purpose could simply be emulated by using an appropriately limited UMA zone. For example the exec_map is very small and only used for the exec arguments. Putting this into pageable memory isn't very useful anymore. Also the interesting construct of the clean_map containing only the buffer_map and pager_map doesn't seem necessary anymore and is probably remains of an earlier incarnation of the VM. Comments, discussion and additional input welcome. -- Andre