From owner-freebsd-current@FreeBSD.ORG Fri Mar 8 09:16:41 2013 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id CC7A572E; Fri, 8 Mar 2013 09:16:41 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 116FA2C6; Fri, 8 Mar 2013 09:16:40 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r289GYwH049203; Fri, 8 Mar 2013 11:16:34 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.0 kib.kiev.ua r289GYwH049203 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r289GYuW049202; Fri, 8 Mar 2013 11:16:34 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 8 Mar 2013 11:16:34 +0200 From: Konstantin Belousov To: Andre Oppermann Subject: Re: Cleanup and untangling of kernel VM initialization Message-ID: <20130308091634.GM3794@kib.kiev.ua> References: <510BC24D.2090406@freebsd.org> <510BF6E0.8070007@rice.edu> <5138C877.9060808@freebsd.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="5TZBROn01cl7bgIF" Content-Disposition: inline In-Reply-To: <5138C877.9060808@freebsd.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: alc@freebsd.org, freebsd-current@freebsd.org, Alan Cox X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Mar 2013 09:16:41 -0000 --5TZBROn01cl7bgIF Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 07, 2013 at 06:03:51PM +0100, Andre Oppermann wrote: > On 01.02.2013 18:09, Alan Cox wrote: > > On 02/01/2013 07:25, Andre Oppermann wrote: > >> Rebase auto-sizing of limits on the available KVM/kmem_map instead of > >> physical > >> memory. Depending on the kernel and architecture configuration these > >> two can > >> be very different. > >> > >> Comments and reviews appreciated. > >> > > > > I would really like to see the issues with the current auto-sizing code > > addressed before any of the stylistic changes or en-masse conversions to > > SYSINIT()s are considered. In particular, can we please start with the > > patch that moves the pipe_map initialization? After that, I think that > > we should revisit tunable_mbinit() and "maxmbufmem". >=20 > OK. I'm trying to describe and explain the big picture for myself and > other interested observers. The following text and explanations are going > to be verbose and sometime redundant. If something is incorrect or incom= plete > please yell, I'm not an expert in all these parts and may easily have mis= sed > some subtle aspects. >=20 > The kernel_map serves as the container of the entire available kernel VM > address space, including the kernel text, data and bss itself, as well as > other bootstrapped and pre-VM allocated structures. >=20 > The kernel_map should cover a reasonable large amount of address space to= be > able to serve the various kernel subsystems demands in memory allocation. > The cpu architecture's address range (32 or 64 bits) puts a hard ceiling = on > the total size of the kernel_map. Depending on the architecture the kern= el_map > covers a special range in the total addressable address range. >=20 > * VM_MIN_KERNEL_ADDRESS > * [KERNBASE] > * kernel_map [actually mapped KVM range, direct allocations] > * kernel text, data, bss > * bootstrap and statically allocated structures [pmap] > * virtual_avail [start of useable KVM] > * kmem_map [submap for (most) UMA zones and kernel malloc] > * exec_map [submap for temporary mapping during process exec()] > * pipe_map [submap for temporary buffering of data between pipe= d processes] > * clean_map [submap for buffer_map and pager_map] > * buffer_map [submap for BIO buffers] > * pager_map [submap for temporary pager IO holding] > * memguard_map [submap for debugging of UMA and kernel malloc] > * ... [kernel_map direct allocations, free and unused spac= e] > * kernel_map [end of kernel_map] > * ... > * virtual_end [end of possible KVM] > * VM_MAX_KERNEL_ADDRESS >=20 > Some kernel_map's submaps are special by being non-pageable and > by pre-allocating the necessary pmap structures to avoid page > faults. The pre-allocation consumes physical memory. Thus a submap's > pre-allocation should not be larger than a reasonable small fraction > of available physical memory to leave enough space for other kernel > and userspace memory demands. Preallocation is done to ensure that calls to functions like pmap_qenter() always succeed and do not sleep for succession. >=20 > The pseudo-code for a dynamic calculation of a submap size would look lik= e this: >=20 > submap.size =3D min(physmem.size / pmap.prealloc_max_fraction / pmap.si= ze_per_page * > page_size, kernel_map.free_size) >=20 > The pmap.prealloc_max_fraction is the largest fraction of physical > memory we allow the pre-allocated pmap structures of a single submap > to occupy. > > Separate submaps are usually used to segregate certain types of memory > usage and to have individual limits applied to them: > > kmem_map: tries to be as large as possible. It serves the bulk of > all dynamically allocated kernel memory usage. It is the memory > pool used by UMA and kernel malloc. Almost all kernel structures > come from here: process-, thread-, file descriptors, mbuf's and > mbuf clusters, network connection control blocks, sockets, etc... > It is not pageable. Calculation: is currently only partially done > dynamically and the MD parts can specify particular min, max limits > and scaling factors. It likely can be generalized and with only very > special platforms requiring additional limits. > > exec_map: is used as temporary storage to set up a processes address > space and related items. It is very small and by default contains > only 16 pages. Calculation: (exec_map_entries * round_page(PATH_MAX > + ARG_MAX)). > > pipe_map: is used to move piped data between processes. It is > pageable memory. Calculation: min(physmem.size, kernel_map.size) / > 64. > > clean_map: overarching submap to contain the buffer_map and > pager_map. Likely no longer necessary and a leftover from earlier > incarnations of the kernel VM. > > buffer_map: is used for BIO structures to perform IO between the > kernel VM and storage media (disk). Not pageable. Calculation: > min(physmem.size, kernel_map.size) / 4 up to 64MB and 1/10 > thereafter. > > pager_map: is used for pager IO to a storage media (disk). Not > pageable. Calculation: MAXPHYS * min(max(nbuf/4, 16), 256). It is more versatile. The space is used for pbufs, and pbufs currently also serve for physio, for the clustering, for aio needs. > > memguard_map: is a special debugging submap substituting parts of=20 > kmem_map. Normally not used. > > There is some competition between these maps for physical memory. One > has to be careful to find a total balance among them wrt. static and > dynamic physical memory use. They mostly compete for KVA, not for the physical memory. > > Within the submaps, especially the kmem_map, we have a number of > dynamic UMA suballocators where we have to put a ceiling on their > total memory usage to prevent them to consume all physical *and/or* > kmem_map virtual memory. This is done with UMA zone limits. Note that architectures with the direct maps do not use kmem_map for the small allocations. The uma_small_alloc() utilizes the direct map for VA of the new page. kmem_map is needed when allocation is multi-page sized, to provide the continuous virtual mapping. > > No externally exploitable single UMA zone should be able to consume > all available physical memory. This applies for example to the > number of processes, file descriptors, sockets, mbufs and mbuf > clusters. These need to be limited to a reasonable and heavy work-load > permitting amount of available physical memory. However there is going > to be overcommit among them and not all them can be at their limit > at the same time. Probably none of these UMA zones should be allowed > to occupy more than 1/2 of all available physical memory. Often > individual UMA zone limits have to be put into context and related to > other concurrent UMA zones. This usually means reduced UMA zone limit > for a particular zone. Balancing this takes a slight amount of voodoo > magic and knowledge of common extreme work-loads to align. On the > other hand for most of those zones allocations are permitted to fail > rendering an attempt at connection establishment unsuccessful. It can > be retried later. > > Generic pseudo-code: UMA zone limit =3D min(kmem_map.size, physmem.size) > / 4 (or other appropriate fraction). > > It could be that some of the kernel_map submaps are no longer > necessary and their purpose could simply be emulated by using an > appropriately limited UMA zone. For example the exec_map is very small > and only used for the exec arguments. Putting this into pageable > memory isn't very useful anymore. I disagree. Having the strings copied on execve() pageable is good, the default size of around 260KB max for the strings is quite a load on the allocator. > > Also the interesting construct of the clean_map containing only > the buffer_map and pager_map doesn't seem necessary anymore and is > probably remains of an earlier incarnation of the VM. > > Comments, discussion and additional input welcome. > > -- Andre --5TZBROn01cl7bgIF Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJROaxxAAoJEJDCuSvBvK1BrIQP/1tKAWRAk8oFDvVuW44tHBdm 6p6yXLnHjAmH3Et1oExh713at/oNcGIsXrj/7MoOQ0jGrZ3dF+dIWC3Rn+5mlyAc VqUIl/6YzKfZV2uDfbZDhqytt0wqYNo1gv5BlseTb/5/naRHt0SM7Rp+JKRRYeDI /2Gndmk8B/qJV5+ADJutOq0ri9cgBGsEV6+ZQYSphk0TpSQRv1WqVSWMwArGM8PI AUBoNiekPFg3cAbC4uYhq8ZMOQrZ4eetVt9f6rAexBqC5GCWVVcOsogeC2xYqMRd AXLfAo75XxtkjB21xUHKkhvbRfy+Zkxhb6LgOgnrK3QE5AnrFNcjWpzxnZ+2bv5g xlf3HjkAufWzEaH+IINKPI4kkjJCK/DyrwzGaf5yn926uRpf5lwcUxXTUcmoAOU5 yWFBjtRuzLt4DvMgsJfg4M7H0wSwSgVYazkDfqH3UJT4iCBe4nX5rtarUYYP7V3i nDf0nxvp6ejfNKk0wB7ABHFGQMD9aq40aie8wZ/55vdy8vpX1458DOX+PTVEQ9ev 5lWmQHRCnpEKd+AhEO1TExghCUTiCCbNR/ntT5Ta7FuTJSCZsmp8HGOqDvHkLO8u XQ4cvhQzjUvZQmyX7bPUnhXlwQ6Sq4vvKn8FGRSuJU+XuV70eN+MBqOsS2RuXF1D CLYlxpK4QaHRP1Z3b5u1 =EFZk -----END PGP SIGNATURE----- --5TZBROn01cl7bgIF--