From owner-freebsd-current@FreeBSD.ORG Fri Dec 23 09:31:21 2005 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A854816A41F; Fri, 23 Dec 2005 09:31:21 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0BAB843D46; Fri, 23 Dec 2005 09:31:17 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [192.168.254.14] (imini.samsco.home [192.168.254.14]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id jBN9VEtD073761; Fri, 23 Dec 2005 02:31:15 -0700 (MST) (envelope-from scottl@samsco.org) Message-ID: <43ABC3E2.1030108@samsco.org> Date: Fri, 23 Dec 2005 02:31:14 -0700 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.7) Gecko/20050416 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jason Evans References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.4 required=3.8 tests=ALL_TRUSTED autolearn=failed version=3.1.0 X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on pooker.samsco.org Cc: freebsd-current@freebsd.org Subject: Re: New malloc ready, take 42 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 23 Dec 2005 09:31:21 -0000 Jason Evans wrote: > On 28 November, I announced a new malloc implementation that I've been > working on as a replacement for the current libc malloc. The main goal > for this rewrite of malloc is to provide better scalability for > multi-threaded programs on SMP systems, without degrading performance > in general. The benchmarks I've run indicate that the code now meets > this goal. > > === > === Resolved issues === > === > > Several people provided valuable feedback regarding architecture, (in) > correct function, and performance. At this point, I think I have > addressed all outstanding issues. Here's a recap: > > * Hiten Pandya and others requested that I used the sys/tree.h > implementation of red-black trees rather than my own. This is done. > > * Jon Dama objected to leaving sbrk() support out of the malloc > implementation for 32-bit architectures. I've added support for sbrk () > on i386 and arm. > > * David Xu pointed out that the hand-rolled spinlocks did not do > priority inheritance, which could lead to priority inversion > deadlocks. This has been fixed by using the libc spinlocks. > > * Kris Kennaway discovered that sparc64 doesn't have TLS, and Olivier > Houchard contacted me about the same issue for arm. I removed the use > of TLS on those architectures. > > * Claus Guttesen and Devon O'Dell reported issues in kldxref and X on > amd64. The kldxref problem has been fixed. The X problem is a bit > trickier. I borrowed a Turion-based laptop from Rob Braun and indeed > found that X wouldn't start. Eric Anholt suggested trying the xorg- > server-snap port rather than xorg-server, and the problem went away. > Eric explained that xorg-server uses custom ELF *.a module loading (the > idea being that the same exact module could be used on different > operating systems), whereas xorg-server-snap uses dlopen(3) and > friends. Apparently, we are on the cusp of the transition to using the > latter, so for now the workaround on amd64 is to use xorg-server- snap, > with the expectation that soon the standard xorg-server port will be > upgraded. > > === > === Benchmarking ad nauseam === > === > > ==> Lever & Boreham benchmarks (multi-threaded) > > Chuck Lever and David Boreham published "malloc() Performance in a > Multithreaded Linux Environment" in the proceedings of the FREENIX > track of the 2000 USENIX Annual Technical Conference. By some > incredible stroke of luck, I found the source code for their > benchmarks, and Kris Kennaway ran the first of them on a 14-CPU sparc64 > system as well as a dual-dual opteron system. (The other two > benchmarks aren't very enlightening.) It's important to keep in mind > that this is a micro-benchmark that demonstrates a worst case scenario > for non-SMP-scalable malloc implementations. That said, with 5 > threads, jemalloc outperformed phkmalloc by 15X on sparc64 and 80X on > amd64. > > ==> super-smack (multi-threaded) > > The SMP scalability tests were very encouraging, but Kris also ran some > super-smack benchmarks on a dual-dual opteron system, and jemalloc was > actually a bit slower than phkmalloc in that case (~2.7-3.7% slower, > depending on jemalloc configuration). I've been obsessing over that > for over a week now, and one of the results is that jemalloc is now > more tunable -- delayed coalescence cache size, number of allocation > arenas, and allocation quantum size are all tunable via MALLOC_OPTIONS > now. > > However, changing the tuning parameters wasn't enough to make up the > performance difference for the super-smack benchmarks. So, I spent > some time looking at memory fragmentation. Taking a cue from Poul- > Henning Kamp's malloc work, I finally wrote a program that converts > ktrace output to memory usage graphs. Here's a series of of graphs > from various stages of the fragmentation avoidance changes I made. The > test program was 'kldxref /boot/kernel' on amd64. > > http://people.freebsd.org/~jasone/jemalloc/plots/kldxref.png > http://people.freebsd.org/~jasone/jemalloc/plots/kldxref13.gif > http://people.freebsd.org/~jasone/jemalloc/plots/kldxref15.gif > > The interesting thing to note is that in the first two plots, the > highly fragmented portion of the graph continuously grows, wheareas in > the last plot, the fragmented portion stops growing about 1/3 of the > way through the program run. (The first plot only shows the bottom > half of what the other two plots show.) This is all really cool, but > alas, it didn't help the super-smack results. I was able to use the > graphing program to determine that cache line sharing was potentially > causing even more issues than suspected, but we already knew that cache > line sharing was an issue as a result of some of the runs that Kris did. > > Finally, last night I discovered that overzealous use of __inline is > likely to have been the culprit. I removed the __inline keyword from > all of the functions that aren't part of the critical path for small > memory allocation/deallocation, and measured a substantial performance > improvement. > > I don't have access to the machine that Kris ran the super-smack > benchmarks on, so I set up a couple of VMware guest systems using > VMware 5.5 on a 3.2 GHz P4 system (HTT enabled for the host, but not > the guests). The two guests were configured very similarly, so that > the only pertinent difference was that one was using phkmalloc, and the > other was using jemalloc. Before the __inline changes, jemalloc was > ~2.5% slower than phkmalloc. Here are the final results, which show > jemalloc to be ~4.1% faster than phkmalloc for the benchmark: > > x ss_phk > + ss_je > +----------------------------------------------------------------------- > ---+ > |xxx x x x x x x + x + ++ ++ + + > + +| > | |_____________A__M__________| | > ___________A___M______| | > +----------------------------------------------------------------------- > ---+ > N Min Max Median Avg > Stddev > x 10 1594.94 1658.64 1623.73 1619.431 > 21.833342 > + 10 1649.06 1706.75 1693.23 1686.275 > 17.410025 > Difference at 95.0% confidence > 66.844 +/- 18.5532 > 4.12762% +/- 1.14566% > (Student's t, pooled s = 19.7459) > > The numbers reported are queries/second, so bigger is better. > > I used mysql 5.0.16, super-smack 1.3, libthr, and MALLOC_OPTIONS='aj', > with 10 replicates of the following command for each VMware guest: > > super-smack -d mysql select-key.smack 4 10000 > > It's worth noting that, as near as I can tell, almost all memory > allocation happens in a single mysqld thread. Presumably, a master > thread is handing buffers to slaves, which process the buffers, then > free them. As such, this test doesn't benefit from jemalloc's multiple > arenas, so it is not a very good test of SMP scalability, at least with > regard to malloc. > > ==> cfrac (single-threaded) > > cfrac is one of many benchmark programs that is included in the Hoard > memory allocator source distribution (see http://www.cs.umass.edu/ > ~emery/hoard/). I ran cfrac as follows: > > cfrac 47582602774358115722167492755475367767 > > (Same VMware setup as for super-smack benchmarks.) > > Wall time (seconds) > phkmalloc 'aj': 18.75, 18.68, 18.69 > jemalloc 'aj': 17.57, 17.65, 17.56 > > I haven't looked at allocation traces of cfrac in quite a while, but > IIRC, it does a lot of small allocations of various sizes. > > ==> sh6bench (single-threaded) > > sh6bench (see http://www.microquill.com/smartheap/shbench/bench.zip) is > a quirky malloc benchmark that has been used in some of the published > malloc literature, so I include it here. sh6bench does cycles of > allocating groups of objects, where the size of objects in each group > is random. Sometimes the objects are held for a while before being > freed. I ran sh6bench with the following interactive runtime parameters: > > call count: 50000 > min block size: 1 > max block size: 1000 > > phkmalloc doesn't fare very well with its default settings since the > benchmark's memory usage fluctuates enough to cause phkmalloc to > repeatedly allocate and free pages. Therefore I report numbers for > multiple MALLOC_OPTIONS settings. Since the program prompts for > interactive input, I report the sum of user and sys time, rather than > wall time. (Same VMware setup as for super-smack benchmarks.) > > user+sys time (seconds) > phkmalloc 'aj': 25.66, 25.78, 22.50 > phkmalloc 'aj>>>>': 13.72, 13.77, 13.69 > jemalloc 'aj': 17.88, 17.05, 17.10 > > If phkmalloc's cache size is increased adequately, it beats jemalloc. > jemalloc simply has to do more work when splitting and coalescing > regions than phkmalloc does, and this benchmark severely stresses that > aspect of jemalloc. It's perhaps worth noting that jemalloc's peak > memory usage is ~22% lower than phkmalloc's, which means that it's > doing a better job of avoiding fragmentation. At least the extra > algorithmic overhead gains us something. > > ==> gs (single-threaded) > > Ghostscript is the well known PostScript interpreter. I took the > biggest PDF I have (the PostScript Lanuguage Reference, 3rd Ed.) and > converted it to PostScript, then ran AFPL Ghostscript 8.53 as follows: > > gs -dBATCH -dNODISPLAY PS3.ps > > As with sh6bench, memory usage fluctuated enough to cause excessive > allocation/deallocation of pages with phkmalloc, so I report times for > multiple MALLOC_OPTIONS settings. (Same VMware setup as for > super-smack benchmarks.) > > Wall time (seconds) > phkmalloc 'aj': 92.12, 92.03, 92.90 > phkmalloc 'aj>>>>': 61.11, 61.29, 61.73 > jemalloc 'aj': 62.34, 62.43, 62.20 > jemalloc 'ajQQQc': 60.85, 60.48, 60.81 > > Okay, the 'ajQQQc' parameterization for jemalloc is a bit silly, but if > phkmalloc gets help, then so should jemalloc. =) > > === > === Proposed approach for inclusion in CURRENT === > === > > Here's a current version of all the changes that are necessary for > jemalloc to go in to the tree: > > http://people.freebsd.org/~jasone/jemalloc/patches/ > jemalloc_20051222c.diff > > This patch includes (roughly): > > 1) Fix gensnmptree bug (missing variable initialization). I emailed > the maintainer, Hartmut Brandt, about this today, and am awaiting a reply. > > 2) Fix kldxref bug (assumes allocation is adequately aligned). This > needs to be committed along with (5), since the fix uses posix_memalign(). > > 3) Move calloc() from calloc.c to malloc.c (enables better statistics > gathering). > > 4) Add _malloc_{pre,post}fork(), for use by threading libraries. > > 5) Replace malloc(), calloc(), realloc(), and free(). Add > posix_memalign(). > > I'd like to commit (3) and (4) first, so that we have a version of > phkmalloc in the cvs repository that is API-compatible with libc as it > will be after jemalloc is committed. This will make it easier to swap > the phkmalloc code in, should we wish to do further benchmarking or > regression testing at some point in the future. Poul-Henning has > agreed with this in principle, though I haven't yet supplied him with > diffs for only (3) and (4). > > So, how about it? Is jemalloc ready to go in now? > > Thanks, > Jason I think that this is overall a good plan. Two things need to be stressed, since they will quickly become FAQs. First is that phkmalloc and jemalloc won't be interchangable at runtime, like the threading libraries are. Second is that amd64 will be stuck without working X until the 6.9/7.0 stuff gets in the tree, and people should be well aware of this. Another thing that I worry about is complex library scenarios where you might have different versions of libc getting pulled into the same process by different library dependencies. This could turn into a big headache that is only solvable by telling people to wipe out their entire ports collection and rebuild from scratch. Does this warrant a library version bump now (as much as I really don't want to advocate this)? Scott