From owner-freebsd-stable@FreeBSD.ORG Tue Aug 23 02:47:48 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B498F1065672 for ; Tue, 23 Aug 2011 02:47:48 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [69.163.100.197]) by mx1.freebsd.org (Postfix) with ESMTP id 9181B8FC12 for ; Tue, 23 Aug 2011 02:47:48 +0000 (UTC) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.5/8.14.4) with ESMTP id p7N2amWC040698; Mon, 22 Aug 2011 19:36:48 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.5/8.14.4/Submit) id p7N2akxB040696; Mon, 22 Aug 2011 19:36:46 -0700 (PDT) Date: Mon, 22 Aug 2011 19:36:46 -0700 (PDT) From: Matthew Dillon Message-Id: <201108230236.p7N2akxB040696@apollo.backplane.com> To: freebsd-stable@freebsd.org, Alan Cox References: <4E4143A6.6030307@digsys.bg> <935F8EC2-88E0-45A3-BE8B-7210BE223BC5@mac.com> <4e42a0c0.e2t/9MF98O3HFjb1%perryh@pluto.rain.com> <4E4CCA6C.8020408@ipfw.ru> <20110820174147.GW17489@deviant.kiev.zoral.com.ua> <4E4FFAD3.4090706@rice.edu> <4E500014.6030800@ipfw.ru> <20110820191726.GY17489@deviant.kiev.zoral.com.ua> Cc: Subject: Re: 32GB limit per swap device? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Aug 2011 02:47:48 -0000 The limitation was ONLY due to a *minor* 32-bit integer overflow in one or two *intermediate* calculations in the radix tree code, which I long ago fixed in DragonFly. Just find the changes in the DFly codebase and determine if they need to be applied. The swap space radix code (which I wrote long ago) is in page-sized blocks, so you actually probably want to keep using a 32-bit integer for the block number there to keep the physical memory reservation required for the radix tree low. If you just pop the base block id up to 64 bits without adjusting the radix code to overlay a 64 bit bitmap on it you waste a lot of physical memory for the same amount of swap reservation. This is NOT where the limitation lies. It was strictly an intermediate calculation that caused the original limitation. With 32 bit block numbers stored in the radix tree nodes in the swap code the physical limitation is something like 1 to 4 TB of total swap. I forget exactly but it is at least 1TB. I've tested 1TB swap partitions on DragonFly with just the minor fixes to the original radix tree code. -- Also note that I believe FreeBSD has done away with the interleaved swap. I'm not sure why, I guess geom can interleave the swap for you but I've always thought that it would be easier to just specify and add the partitions separately so one has the flexibility to swapon and swapoff the individual partitions on a live system. Interleaving is important because you get an almost perfect performance multiplier. You don't want to just append the swap partitions after each other. -- One last thing: The amount of wired physical memory required is still on the order of ~1MB per ~1GB of swap. A 32-bit kernel is thus still limited by available KVM, effectively limiting you to around ~32G of swap depending on various factors if you do not want to run the system out of KVM. I've run upwards of 128G of swap on 32-bit systems but it really pushed the KVM use and I would not recommend it. A 64-bit kernel is *NOT* limited by KVM. Swap is effectively limited to ~1TB or ~2TB using the original radix code with the one or two intermediate overflow fixes applied. The daddr_t in the original radix code can remain 32-bits (in DragonFly I typedef'd another name so I could explicitly make it 32-bits regardless of daddr_t). Large amounts of swap space are becoming important as things like tmpfs (and swapcache in DragonFly as well) can really make use of it. Swap performance (the ability to interleave the swap space) is also important for the same reason. Interleaved swap on two SATA-III SSDs is just insane... gives you something like 800MB/sec of aggregate read bandwidth. -Matt