From owner-freebsd-current@FreeBSD.ORG Fri Dec 23 09:14:16 2005 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A749216A41F for ; Fri, 23 Dec 2005 09:14:16 +0000 (GMT) (envelope-from jasone@freebsd.org) Received: from lh.synack.net (lh.synack.net [204.152.188.37]) by mx1.FreeBSD.org (Postfix) with ESMTP id AF74243D5F for ; Fri, 23 Dec 2005 09:14:15 +0000 (GMT) (envelope-from jasone@freebsd.org) Received: by lh.synack.net (Postfix, from userid 100) id 4C8E85E48ED; Fri, 23 Dec 2005 01:14:15 -0800 (PST) Received: from [192.168.168.203] (moscow-cuda-gen2-68-64-60-20.losaca.adelphia.net [68.64.60.20]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by lh.synack.net (Postfix) with ESMTP id 0FF735E48A2 for ; Fri, 23 Dec 2005 01:14:11 -0800 (PST) Mime-Version: 1.0 (Apple Message framework v746.2) Content-Transfer-Encoding: 7bit Message-Id: Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed To: freebsd-current@freebsd.org From: Jason Evans Date: Fri, 23 Dec 2005 01:14:08 -0800 X-Mailer: Apple Mail (2.746.2) X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on lh.synack.net X-Spam-Level: * X-Spam-Status: No, score=1.8 required=5.0 tests=RCVD_IN_NJABL_DUL, RCVD_IN_SORBS_DUL autolearn=no version=3.0.4 Subject: New malloc ready, take 42 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 23 Dec 2005 09:14:16 -0000 On 28 November, I announced a new malloc implementation that I've been working on as a replacement for the current libc malloc. The main goal for this rewrite of malloc is to provide better scalability for multi-threaded programs on SMP systems, without degrading performance in general. The benchmarks I've run indicate that the code now meets this goal. === === Resolved issues === === Several people provided valuable feedback regarding architecture, (in) correct function, and performance. At this point, I think I have addressed all outstanding issues. Here's a recap: * Hiten Pandya and others requested that I used the sys/tree.h implementation of red-black trees rather than my own. This is done. * Jon Dama objected to leaving sbrk() support out of the malloc implementation for 32-bit architectures. I've added support for sbrk () on i386 and arm. * David Xu pointed out that the hand-rolled spinlocks did not do priority inheritance, which could lead to priority inversion deadlocks. This has been fixed by using the libc spinlocks. * Kris Kennaway discovered that sparc64 doesn't have TLS, and Olivier Houchard contacted me about the same issue for arm. I removed the use of TLS on those architectures. * Claus Guttesen and Devon O'Dell reported issues in kldxref and X on amd64. The kldxref problem has been fixed. The X problem is a bit trickier. I borrowed a Turion-based laptop from Rob Braun and indeed found that X wouldn't start. Eric Anholt suggested trying the xorg- server-snap port rather than xorg-server, and the problem went away. Eric explained that xorg-server uses custom ELF *.a module loading (the idea being that the same exact module could be used on different operating systems), whereas xorg-server-snap uses dlopen(3) and friends. Apparently, we are on the cusp of the transition to using the latter, so for now the workaround on amd64 is to use xorg-server- snap, with the expectation that soon the standard xorg-server port will be upgraded. === === Benchmarking ad nauseam === === ==> Lever & Boreham benchmarks (multi-threaded) Chuck Lever and David Boreham published "malloc() Performance in a Multithreaded Linux Environment" in the proceedings of the FREENIX track of the 2000 USENIX Annual Technical Conference. By some incredible stroke of luck, I found the source code for their benchmarks, and Kris Kennaway ran the first of them on a 14-CPU sparc64 system as well as a dual-dual opteron system. (The other two benchmarks aren't very enlightening.) It's important to keep in mind that this is a micro-benchmark that demonstrates a worst case scenario for non-SMP-scalable malloc implementations. That said, with 5 threads, jemalloc outperformed phkmalloc by 15X on sparc64 and 80X on amd64. ==> super-smack (multi-threaded) The SMP scalability tests were very encouraging, but Kris also ran some super-smack benchmarks on a dual-dual opteron system, and jemalloc was actually a bit slower than phkmalloc in that case (~2.7-3.7% slower, depending on jemalloc configuration). I've been obsessing over that for over a week now, and one of the results is that jemalloc is now more tunable -- delayed coalescence cache size, number of allocation arenas, and allocation quantum size are all tunable via MALLOC_OPTIONS now. However, changing the tuning parameters wasn't enough to make up the performance difference for the super-smack benchmarks. So, I spent some time looking at memory fragmentation. Taking a cue from Poul- Henning Kamp's malloc work, I finally wrote a program that converts ktrace output to memory usage graphs. Here's a series of of graphs from various stages of the fragmentation avoidance changes I made. The test program was 'kldxref /boot/kernel' on amd64. http://people.freebsd.org/~jasone/jemalloc/plots/kldxref.png http://people.freebsd.org/~jasone/jemalloc/plots/kldxref13.gif http://people.freebsd.org/~jasone/jemalloc/plots/kldxref15.gif The interesting thing to note is that in the first two plots, the highly fragmented portion of the graph continuously grows, wheareas in the last plot, the fragmented portion stops growing about 1/3 of the way through the program run. (The first plot only shows the bottom half of what the other two plots show.) This is all really cool, but alas, it didn't help the super-smack results. I was able to use the graphing program to determine that cache line sharing was potentially causing even more issues than suspected, but we already knew that cache line sharing was an issue as a result of some of the runs that Kris did. Finally, last night I discovered that overzealous use of __inline is likely to have been the culprit. I removed the __inline keyword from all of the functions that aren't part of the critical path for small memory allocation/deallocation, and measured a substantial performance improvement. I don't have access to the machine that Kris ran the super-smack benchmarks on, so I set up a couple of VMware guest systems using VMware 5.5 on a 3.2 GHz P4 system (HTT enabled for the host, but not the guests). The two guests were configured very similarly, so that the only pertinent difference was that one was using phkmalloc, and the other was using jemalloc. Before the __inline changes, jemalloc was ~2.5% slower than phkmalloc. Here are the final results, which show jemalloc to be ~4.1% faster than phkmalloc for the benchmark: x ss_phk + ss_je +----------------------------------------------------------------------- ---+ |xxx x x x x x x + x + ++ ++ + + + +| | |_____________A__M__________| | ___________A___M______| | +----------------------------------------------------------------------- ---+ N Min Max Median Avg Stddev x 10 1594.94 1658.64 1623.73 1619.431 21.833342 + 10 1649.06 1706.75 1693.23 1686.275 17.410025 Difference at 95.0% confidence 66.844 +/- 18.5532 4.12762% +/- 1.14566% (Student's t, pooled s = 19.7459) The numbers reported are queries/second, so bigger is better. I used mysql 5.0.16, super-smack 1.3, libthr, and MALLOC_OPTIONS='aj', with 10 replicates of the following command for each VMware guest: super-smack -d mysql select-key.smack 4 10000 It's worth noting that, as near as I can tell, almost all memory allocation happens in a single mysqld thread. Presumably, a master thread is handing buffers to slaves, which process the buffers, then free them. As such, this test doesn't benefit from jemalloc's multiple arenas, so it is not a very good test of SMP scalability, at least with regard to malloc. ==> cfrac (single-threaded) cfrac is one of many benchmark programs that is included in the Hoard memory allocator source distribution (see http://www.cs.umass.edu/ ~emery/hoard/). I ran cfrac as follows: cfrac 47582602774358115722167492755475367767 (Same VMware setup as for super-smack benchmarks.) Wall time (seconds) phkmalloc 'aj': 18.75, 18.68, 18.69 jemalloc 'aj': 17.57, 17.65, 17.56 I haven't looked at allocation traces of cfrac in quite a while, but IIRC, it does a lot of small allocations of various sizes. ==> sh6bench (single-threaded) sh6bench (see http://www.microquill.com/smartheap/shbench/bench.zip) is a quirky malloc benchmark that has been used in some of the published malloc literature, so I include it here. sh6bench does cycles of allocating groups of objects, where the size of objects in each group is random. Sometimes the objects are held for a while before being freed. I ran sh6bench with the following interactive runtime parameters: call count: 50000 min block size: 1 max block size: 1000 phkmalloc doesn't fare very well with its default settings since the benchmark's memory usage fluctuates enough to cause phkmalloc to repeatedly allocate and free pages. Therefore I report numbers for multiple MALLOC_OPTIONS settings. Since the program prompts for interactive input, I report the sum of user and sys time, rather than wall time. (Same VMware setup as for super-smack benchmarks.) user+sys time (seconds) phkmalloc 'aj': 25.66, 25.78, 22.50 phkmalloc 'aj>>>>': 13.72, 13.77, 13.69 jemalloc 'aj': 17.88, 17.05, 17.10 If phkmalloc's cache size is increased adequately, it beats jemalloc. jemalloc simply has to do more work when splitting and coalescing regions than phkmalloc does, and this benchmark severely stresses that aspect of jemalloc. It's perhaps worth noting that jemalloc's peak memory usage is ~22% lower than phkmalloc's, which means that it's doing a better job of avoiding fragmentation. At least the extra algorithmic overhead gains us something. ==> gs (single-threaded) Ghostscript is the well known PostScript interpreter. I took the biggest PDF I have (the PostScript Lanuguage Reference, 3rd Ed.) and converted it to PostScript, then ran AFPL Ghostscript 8.53 as follows: gs -dBATCH -dNODISPLAY PS3.ps As with sh6bench, memory usage fluctuated enough to cause excessive allocation/deallocation of pages with phkmalloc, so I report times for multiple MALLOC_OPTIONS settings. (Same VMware setup as for super-smack benchmarks.) Wall time (seconds) phkmalloc 'aj': 92.12, 92.03, 92.90 phkmalloc 'aj>>>>': 61.11, 61.29, 61.73 jemalloc 'aj': 62.34, 62.43, 62.20 jemalloc 'ajQQQc': 60.85, 60.48, 60.81 Okay, the 'ajQQQc' parameterization for jemalloc is a bit silly, but if phkmalloc gets help, then so should jemalloc. =) === === Proposed approach for inclusion in CURRENT === === Here's a current version of all the changes that are necessary for jemalloc to go in to the tree: http://people.freebsd.org/~jasone/jemalloc/patches/ jemalloc_20051222c.diff This patch includes (roughly): 1) Fix gensnmptree bug (missing variable initialization). I emailed the maintainer, Hartmut Brandt, about this today, and am awaiting a reply. 2) Fix kldxref bug (assumes allocation is adequately aligned). This needs to be committed along with (5), since the fix uses posix_memalign(). 3) Move calloc() from calloc.c to malloc.c (enables better statistics gathering). 4) Add _malloc_{pre,post}fork(), for use by threading libraries. 5) Replace malloc(), calloc(), realloc(), and free(). Add posix_memalign(). I'd like to commit (3) and (4) first, so that we have a version of phkmalloc in the cvs repository that is API-compatible with libc as it will be after jemalloc is committed. This will make it easier to swap the phkmalloc code in, should we wish to do further benchmarking or regression testing at some point in the future. Poul-Henning has agreed with this in principle, though I haven't yet supplied him with diffs for only (3) and (4). So, how about it? Is jemalloc ready to go in now? Thanks, Jason