From owner-freebsd-current@FreeBSD.ORG Mon Dec 12 02:30:50 2005 Return-Path: X-Original-To: current@freebsd.org Delivered-To: freebsd-current@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 85E5816A41F for ; Mon, 12 Dec 2005 02:30:50 +0000 (GMT) (envelope-from jasone@canonware.com) Received: from lh.synack.net (lh.synack.net [204.152.188.37]) by mx1.FreeBSD.org (Postfix) with ESMTP id 01EC843D5F for ; Mon, 12 Dec 2005 02:30:49 +0000 (GMT) (envelope-from jasone@canonware.com) Received: by lh.synack.net (Postfix, from userid 100) id B14855E48ED; Sun, 11 Dec 2005 18:30:49 -0800 (PST) Received: from [192.168.168.203] (moscow-cuda-gen2-68-64-60-20.losaca.adelphia.net [68.64.60.20]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by lh.synack.net (Postfix) with ESMTP id 4D4C65E47EF; Sun, 11 Dec 2005 18:30:44 -0800 (PST) In-Reply-To: <20051212014852.GA8775@shaka.acc.umu.se> References: <20051212014852.GA8775@shaka.acc.umu.se> Mime-Version: 1.0 (Apple Message framework v746.2) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <9FAD2B4B-C167-42D7-A8E7-BE03F4C07543@canonware.com> Content-Transfer-Encoding: 7bit From: Jason Evans Date: Sun, 11 Dec 2005 18:27:14 -0800 To: Johan Bucht X-Mailer: Apple Mail (2.746.2) X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on lh.synack.net X-Spam-Level: * X-Spam-Status: No, score=1.8 required=5.0 tests=RCVD_IN_NJABL_DUL, RCVD_IN_SORBS_DUL autolearn=no version=3.0.4 Cc: current@freebsd.org Subject: Re: New libc malloc patch X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Dec 2005 02:30:50 -0000 Johan, Thank you for your code review! Specific responses follow. On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote: > > Just a few comments, not sure if all of them are valid since I don't > have a recent enough system to apply the patch and test it, so my > comments are just based on reading the patch. I might have missed some > thing hidden between all the lines differing. Can you put up a patched > malloc.c somewhere? http://www.canonware.com/~jasone/jemalloc/malloc.c > * Fork safety functions > Nice to have for all allocators and is something I missed having. > Would > like to have them regardless if your malloc becomes standard or not. I think that the implementation is currently fork-safe. The threads libraries call _malloc_prefork() and _malloc_postfork() in order to avoid locking issues. > * magazines, per-arena caches > If I didn't miss something you don't seem to have it, pardon if I > missed > them in my quick readings. They can really help the scalability and is > something that is implemented in Solaris libumem allocator with great > success. I implemented them and they helped bring the scalability up > quite a bit on the system I tested my allocator on aswell as bringing > the allocation latency down. Arenas do use caches for objects below a size threshold (currently ~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems). The caching mechanism is quite different than magazines, but still very effective. > * Saw you used a tree to look up the allocation sizes at free, this > * differs from how I and Solaris libumem does it. Instead I went for > * prepending a 8/16 byte tag containing the size of the allocation for > * lockless size lookup. Actually, my implementation prepends a 4 byte tag that contains allocated/free flags, and the size of the allocation. The trees are only used to look up the sizes of huge allocations (> 8 MB on 32-bit platforms, and > 256 MB on 64-bit platforms). Huge allocations start at chunk boundaries, so prepending a tag would require an entire extra page (and would cause potentially serious external fragmentation on 32-bit systems). > * Bucket sizes > If I didn't miss something you are using power of two sizes. Might be > better to use smaller sizes to avoid internal fragmentation, aswell as > improving locking granularity. My implementation uses quantum-sized buckets (not power of two size classes). The quantum size is either 8 or 16 bytes, depending on the machine architecture. > * Locking granularity > You use a single lock for the chunk allocation instead of a lock per > bucket size or something similar. This means that you limit access > unless threads allocate from their private arenas. Since you hash > to get > an arena there might be multiple threads accessing the same arena > at the > same time introducing contention. Might be better to have a lock per > allocation size to somewhat improve the situation. Using a lock per allocation size requires that slower algorithms be used in some places (and it certainly complicates certain other operations). This didn't seem like a worthwhile tradeoff to me. By making the code faster and simpler, less time is spent in the malloc code. Unless an app does nothing but malloc/free, I don't expect this to be an issue. Even microbenchmarks that do nothing but malloc/ free don't appear to suffer. > * Locking primitive > The biggest issue and as David Xu pointed out is probably the locking > primitives. The SPINLOCK use has a limit in the threading library and > makes is hard to have a lot of mutexes. I ended up using a wrapper > around the umtx_lock function to get recursive mutexes and it would > probably be better to extend the umtx functions to handle recursion. > This would probably also be appreciated by other malloc > implementations. > Might be interesting to implement some of the ideas from the Linux > futex > implementation to help umtx. I have been contemplating creating a separate spinlock API that doesn't require the threads library to track the spinlocks across fork. This would (if I understand correctly) remove the current static spinlock limitations. As for supporting recursive spinlocks, I doubt that the overhead would be acceptable in general. If I could get rid of the need for the one recursive lock in malloc.c, I certainly would. =) > * Big allocations > It might be preferable to use mmap/munmap directly for big allocations > (bigger than a few pages) to avoid the overhead of trying to be smart. > These big allocations should be pretty few and shouldn't be that > pessimized by this. It also means some memory savings as you can > directly free memory from big allocations. I don't really see the approach I'm taking as "trying to be smart", so much as "making it simple by using the same algorithm for almost everything". Many other malloc implementations I've examined use three or more allocation approaches, depending on the allocation size. jemalloc uses only two, and huge allocations really are huge, such that they rarely if ever happen. In practice, I don't think there is a real memory savings to be had. Besides, madvise() calls in strategic locations would achieve approximately the same result. > * Ownership > Didn't look that much for it so might have missed it. Basicly what > happens if you free memory allocated in another arena, do you return > memory to the arena the memory was allocated from or to the current > threads arena? Not returning it to the original arena leads to cache > line sharing. Returning it to the original adds some overhead and it > might be argued that applications doing this might not be considered > scalable and should be fixed instead. Allocated objects can be associated with their parent arenas in constant time (by masking bits in the object addresses). This makes it possible to always free memory back to the same arena it came from. > * Standalone version > Would be nice to have a standalone version to test the allocator on > solaris and linux to compare against their allocators. Would be > nice for > all the people with only access to SMP machines running another OS. I > tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both > helpful in testing scalability and impact on small machines aswell as > finding bugs in the implementation. I tested my allocator against > Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x > UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my > allocator vs the FreeBSD libc on my 400MHz laptop with too much beer > and almost non working cpu fan. Beer as in standing in 10cm of beer in > my backpack =). > Not sure how much of the implementation is FreeBSD dependant, RB- > trees?, I actually have a separate more portable implementation of jemalloc that is part of a programming language runtime library (jemalloc is simply a FreeBSD-ized stand-alone version of that allocator). I've tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris. On Linux I've compared against ptmalloc2, dlmalloc, and hoard. Unfortunately, that version of the allocator has to make some performance sacrifices regarding locking, since it doesn't have access to the pthreads internals. As such, it requires a lot of care to interpret performance results, so I haven't felt it to be worthwhile providing that version here. > * Focus - Single thread vs multithread > What are the focus of the allocator, how much memory is it worth to > waste to implement a good scalable implementation. This is the > reason I > think Solaris still keeps a serial allocator in libc and makes libumem > accessible through LD_PRELOAD and friends. jemalloc only creates two arenas for single-threaded applications (one for internal memory management, and one for user requests), so it doesn't really sacrifice much in the case of single-threaded programs -- a few pages maybe. In the case of multi-threaded programs, it potentially sacrifices memory compactness in order to reduce contention and cache line sharing. > * Comment > Might be wise to fix the comment about always using mmap since brk can > be used. Thanks, done. Jason