From owner-freebsd-current@FreeBSD.ORG Mon Dec 12 03:15:06 2005 Return-Path: X-Original-To: current@freebsd.org Delivered-To: freebsd-current@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 89AE916A41F for ; Mon, 12 Dec 2005 03:15:06 +0000 (GMT) (envelope-from bucht@acc.umu.se) Received: from mail.acc.umu.se (mail.acc.umu.se [130.239.18.156]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8C62A43D5E for ; Mon, 12 Dec 2005 03:15:05 +0000 (GMT) (envelope-from bucht@acc.umu.se) Received: from localhost (localhost [127.0.0.1]) by amavisd-new (Postfix) with ESMTP id 16393496F; Mon, 12 Dec 2005 04:15:04 +0100 (MET) Received: from [192.168.1.100] (53dbce07.umea.cust.skycom.se [83.219.206.7]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.acc.umu.se (Postfix) with ESMTP id A07574989; Mon, 12 Dec 2005 04:15:01 +0100 (MET) Message-ID: <439CEB74.9080505@acc.umu.se> Date: Mon, 12 Dec 2005 04:16:04 +0100 From: Johan Bucht User-Agent: Mozilla Thunderbird 1.0.2 (X11/20050616) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jason Evans References: <20051212014852.GA8775@shaka.acc.umu.se> <9FAD2B4B-C167-42D7-A8E7-BE03F4C07543@canonware.com> In-Reply-To: <9FAD2B4B-C167-42D7-A8E7-BE03F4C07543@canonware.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: amavisd-new at acc.umu.se Cc: current@freebsd.org Subject: Re: New libc malloc patch X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Dec 2005 03:15:06 -0000 Jason Evans wrote: > Johan, > > Thank you for your code review! Specific responses follow. Wouldn't exactly call scrolling through a diff at 3 AM review =). > > On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote: > >> >> Just a few comments, not sure if all of them are valid since I don't >> have a recent enough system to apply the patch and test it, so my >> comments are just based on reading the patch. I might have missed some >> thing hidden between all the lines differing. Can you put up a patched >> malloc.c somewhere? > > > http://www.canonware.com/~jasone/jemalloc/malloc.c > Thanks! >> * Fork safety functions >> Nice to have for all allocators and is something I missed having. Would >> like to have them regardless if your malloc becomes standard or not. > > > I think that the implementation is currently fork-safe. The threads > libraries call _malloc_prefork() and _malloc_postfork() in order to > avoid locking issues. > Hmm, I meant that the _malloc_prefork() functions are independent from your malloc allocation and I would like them committed regardless as they make the life easier for other malloc implementation. >> * magazines, per-arena caches >> If I didn't miss something you don't seem to have it, pardon if I >> missed >> them in my quick readings. They can really help the scalability and is >> something that is implemented in Solaris libumem allocator with great >> success. I implemented them and they helped bring the scalability up >> quite a bit on the system I tested my allocator on aswell as bringing >> the allocation latency down. > > > Arenas do use caches for objects below a size threshold (currently > ~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems). > The caching mechanism is quite different than magazines, but still > very effective. > Ah nice, shows how little review I did. =) Will look a bit more at this then. >> * Saw you used a tree to look up the allocation sizes at free, this >> * differs from how I and Solaris libumem does it. Instead I went for >> * prepending a 8/16 byte tag containing the size of the allocation for >> * lockless size lookup. > > > Actually, my implementation prepends a 4 byte tag that contains > allocated/free flags, and the size of the allocation. The trees are > only used to look up the sizes of huge allocations (> 8 MB on 32-bit > platforms, and > 256 MB on 64-bit platforms). Huge allocations start > at chunk boundaries, so prepending a tag would require an entire > extra page (and would cause potentially serious external > fragmentation on 32-bit systems). > Isn't 8 byte alignment expected by some applications? How do you know if a allocation is huge if you don't have a tag? Something more to read up on i guess. =) >> * Bucket sizes >> If I didn't miss something you are using power of two sizes. Might be >> better to use smaller sizes to avoid internal fragmentation, aswell as >> improving locking granularity. > > > My implementation uses quantum-sized buckets (not power of two size > classes). The quantum size is either 8 or 16 bytes, depending on the > machine architecture. > Ahh sorry, just saw the pow stuff and assumed it was power of two for the buckets. >> * Locking granularity >> You use a single lock for the chunk allocation instead of a lock per >> bucket size or something similar. This means that you limit access >> unless threads allocate from their private arenas. Since you hash to >> get >> an arena there might be multiple threads accessing the same arena at >> the >> same time introducing contention. Might be better to have a lock per >> allocation size to somewhat improve the situation. > > > Using a lock per allocation size requires that slower algorithms be > used in some places (and it certainly complicates certain other > operations). This didn't seem like a worthwhile tradeoff to me. By > making the code faster and simpler, less time is spent in the malloc > code. Unless an app does nothing but malloc/free, I don't expect > this to be an issue. Even microbenchmarks that do nothing but malloc/ > free don't appear to suffer. > Yeah, not sure how much of an impact it has on your allocator it has, just something that improved my allocator as I move magazines between local arenas and the global. >> * Locking primitive >> The biggest issue and as David Xu pointed out is probably the locking >> primitives. The SPINLOCK use has a limit in the threading library and >> makes is hard to have a lot of mutexes. I ended up using a wrapper >> around the umtx_lock function to get recursive mutexes and it would >> probably be better to extend the umtx functions to handle recursion. >> This would probably also be appreciated by other malloc >> implementations. >> Might be interesting to implement some of the ideas from the Linux >> futex >> implementation to help umtx. > > > I have been contemplating creating a separate spinlock API that > doesn't require the threads library to track the spinlocks across > fork. This would (if I understand correctly) remove the current > static spinlock limitations. > > As for supporting recursive spinlocks, I doubt that the overhead > would be acceptable in general. If I could get rid of the need for > the one recursive lock in malloc.c, I certainly would. =) > Yeah, made some attempts to remove my recursive locks in the global arena but it made the code more complicated and didn't really do much. >> * Big allocations >> It might be preferable to use mmap/munmap directly for big allocations >> (bigger than a few pages) to avoid the overhead of trying to be smart. >> These big allocations should be pretty few and shouldn't be that >> pessimized by this. It also means some memory savings as you can >> directly free memory from big allocations. > > > I don't really see the approach I'm taking as "trying to be smart", > so much as "making it simple by using the same algorithm for almost > everything". Many other malloc implementations I've examined use > three or more allocation approaches, depending on the allocation > size. jemalloc uses only two, and huge allocations really are huge, > such that they rarely if ever happen. In practice, I don't think > there is a real memory savings to be had. Besides, madvise() calls > in strategic locations would achieve approximately the same result. > Yeah, saving them for future use through madvise might be better. I chose the route to keep the big allocations simple and avoid caching and looking up stuff for. Basicly read the size and if it's big munmap it directly. >> * Ownership >> Didn't look that much for it so might have missed it. Basicly what >> happens if you free memory allocated in another arena, do you return >> memory to the arena the memory was allocated from or to the current >> threads arena? Not returning it to the original arena leads to cache >> line sharing. Returning it to the original adds some overhead and it >> might be argued that applications doing this might not be considered >> scalable and should be fixed instead. > > > Allocated objects can be associated with their parent arenas in > constant time (by masking bits in the object addresses). This makes > it possible to always free memory back to the same arena it came from. > Ok, good. >> * Standalone version >> Would be nice to have a standalone version to test the allocator on >> solaris and linux to compare against their allocators. Would be nice >> for >> all the people with only access to SMP machines running another OS. I >> tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both >> helpful in testing scalability and impact on small machines aswell as >> finding bugs in the implementation. I tested my allocator against >> Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x >> UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my >> allocator vs the FreeBSD libc on my 400MHz laptop with too much beer >> and almost non working cpu fan. Beer as in standing in 10cm of beer in >> my backpack =). >> Not sure how much of the implementation is FreeBSD dependant, RB- >> trees?, > > > I actually have a separate more portable implementation of jemalloc > that is part of a programming language runtime library (jemalloc is > simply a FreeBSD-ized stand-alone version of that allocator). I've > tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris. On Linux I've > compared against ptmalloc2, dlmalloc, and hoard. Unfortunately, that > version of the allocator has to make some performance sacrifices > regarding locking, since it doesn't have access to the pthreads > internals. As such, it requires a lot of care to interpret > performance results, so I haven't felt it to be worthwhile providing > that version here. > ok >> * Focus - Single thread vs multithread >> What are the focus of the allocator, how much memory is it worth to >> waste to implement a good scalable implementation. This is the reason I >> think Solaris still keeps a serial allocator in libc and makes libumem >> accessible through LD_PRELOAD and friends. > > > jemalloc only creates two arenas for single-threaded applications > (one for internal memory management, and one for user requests), so > it doesn't really sacrifice much in the case of single-threaded > programs -- a few pages maybe. In the case of multi-threaded > programs, it potentially sacrifices memory compactness in order to > reduce contention and cache line sharing. > As I have not tested it I take your word. Mostly wanted to know where you stand in the issue of pessimizing single-threaded apps vs MT. And if it's worth keeping phkmalloc for some apps. >> * Comment >> Might be wise to fix the comment about always using mmap since brk can >> be used. > > > Thanks, done. > > Jason No problem, always fun to have some code that you actually understand the idea behind and see how it's implemented. /Johan Bucht