From owner-freebsd-current@FreeBSD.ORG  Mon Dec 12 03:15:06 2005
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
X-Original-To: current@freebsd.org
Delivered-To: freebsd-current@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 89AE916A41F
	for <current@freebsd.org>; Mon, 12 Dec 2005 03:15:06 +0000 (GMT)
	(envelope-from bucht@acc.umu.se)
Received: from mail.acc.umu.se (mail.acc.umu.se [130.239.18.156])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8C62A43D5E
	for <current@freebsd.org>; Mon, 12 Dec 2005 03:15:05 +0000 (GMT)
	(envelope-from bucht@acc.umu.se)
Received: from localhost (localhost [127.0.0.1])
	by amavisd-new (Postfix) with ESMTP id 16393496F;
	Mon, 12 Dec 2005 04:15:04 +0100 (MET)
Received: from [192.168.1.100] (53dbce07.umea.cust.skycom.se [83.219.206.7])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.acc.umu.se (Postfix) with ESMTP id A07574989;
	Mon, 12 Dec 2005 04:15:01 +0100 (MET)
Message-ID: <439CEB74.9080505@acc.umu.se>
Date: Mon, 12 Dec 2005 04:16:04 +0100
From: Johan Bucht <bucht@acc.umu.se>
User-Agent: Mozilla Thunderbird 1.0.2 (X11/20050616)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Jason Evans <jasone@canonware.com>
References: <20051212014852.GA8775@shaka.acc.umu.se>
	<9FAD2B4B-C167-42D7-A8E7-BE03F4C07543@canonware.com>
In-Reply-To: <9FAD2B4B-C167-42D7-A8E7-BE03F4C07543@canonware.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: amavisd-new at acc.umu.se
Cc: current@freebsd.org
Subject: Re: New libc malloc patch
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Dec 2005 03:15:06 -0000

Jason Evans wrote:

> Johan,
>
> Thank you for your code review!  Specific responses follow.

Wouldn't exactly call scrolling through a diff at 3 AM review =).

>
> On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote:
>
>>
>> Just a few comments, not sure if all of them are valid since I don't
>> have a recent enough system to apply the patch and test it, so my
>> comments are just based on reading the patch. I might have missed some
>> thing hidden between all the lines differing. Can you put up a patched
>> malloc.c somewhere?
>
>
> http://www.canonware.com/~jasone/jemalloc/malloc.c
>
Thanks!

>> * Fork safety functions
>> Nice to have for all allocators and is something I missed having.  Would
>> like to have them regardless if your malloc becomes standard or not.
>
>
> I think that the implementation is currently fork-safe.  The threads  
> libraries call _malloc_prefork() and _malloc_postfork() in order to  
> avoid locking issues.
>
Hmm, I meant that the _malloc_prefork() functions are independent from 
your malloc allocation and I would like them committed regardless as 
they make the life easier for other malloc implementation.

>> * magazines, per-arena caches
>> If I didn't miss something you don't seem to have it, pardon if I  
>> missed
>> them in my quick readings. They can really help the scalability and is
>> something that is implemented in Solaris libumem allocator with great
>> success. I implemented them and they helped bring the scalability up
>> quite a bit on the system I tested my allocator on aswell as bringing
>> the allocation latency down.
>
>
> Arenas do use caches for objects below a size threshold (currently  
> ~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems).   
> The caching mechanism is quite different than magazines, but still  
> very effective.
>
Ah nice, shows how little review I did. =)
Will look a bit more at this then.

>> * Saw you used a tree to look up the allocation sizes at free, this
>> * differs from how I and Solaris libumem does it. Instead I went for
>> * prepending a 8/16 byte tag containing the size of the allocation for
>> * lockless size lookup.
>
>
> Actually, my implementation prepends a 4 byte tag that contains  
> allocated/free flags, and the size of the allocation.  The trees are  
> only used to look up the sizes of huge allocations (> 8 MB on 32-bit  
> platforms, and > 256 MB on 64-bit platforms).  Huge allocations start  
> at chunk boundaries, so prepending a tag would require an entire  
> extra page (and would cause potentially serious external  
> fragmentation on 32-bit systems).
>
Isn't 8 byte alignment expected by some applications?
How do you know if a allocation is huge if you don't have a tag?
Something more to read up on i guess. =)


>> * Bucket sizes
>> If I didn't miss something you are using power of two sizes. Might be
>> better to use smaller sizes to avoid internal fragmentation, aswell as
>> improving locking granularity.
>
>
> My implementation uses quantum-sized buckets (not power of two size  
> classes).  The quantum size is either 8 or 16 bytes, depending on the  
> machine architecture.
>
Ahh sorry, just saw the pow stuff and assumed it was power of two for 
the buckets.

>> * Locking granularity
>> You use a single lock for the chunk allocation instead of a lock per
>> bucket size or something similar. This means that you limit access
>> unless threads allocate from their private arenas. Since you hash  to 
>> get
>> an arena there might be multiple threads accessing the same arena  at 
>> the
>> same time introducing contention. Might be better to have a lock per
>> allocation size to somewhat improve the situation.
>
>
> Using a lock per allocation size requires that slower algorithms be  
> used in some places (and it certainly complicates certain other  
> operations).  This didn't seem like a worthwhile tradeoff to me.  By  
> making the code faster and simpler, less time is spent in the malloc  
> code.  Unless an app does nothing but malloc/free, I don't expect  
> this to be an issue.  Even microbenchmarks that do nothing but malloc/ 
> free don't appear to suffer.
>
Yeah, not sure how much of an impact it has on your allocator it has, 
just something that improved my allocator as I move magazines between 
local arenas and the global.

>> * Locking primitive
>> The biggest issue and as David Xu pointed out is probably the locking
>> primitives. The SPINLOCK use has a limit in the threading library and
>> makes is hard to have a lot of mutexes. I ended up using a wrapper
>> around the umtx_lock function to get recursive mutexes and it would
>> probably be better to extend the umtx functions to handle recursion.
>> This would probably also be appreciated by other malloc  
>> implementations.
>> Might be interesting to implement some of the ideas from the Linux  
>> futex
>> implementation to help umtx.
>
>
> I have been contemplating creating a separate spinlock API that  
> doesn't require the threads library to track the spinlocks across  
> fork.  This would (if I understand correctly) remove the current  
> static spinlock limitations.
>
> As for supporting recursive spinlocks, I doubt that the overhead  
> would be acceptable in general.  If I could get rid of the need for  
> the one recursive lock in malloc.c, I certainly would. =)
>
Yeah, made some attempts to remove my recursive locks in the global 
arena but it made the code more complicated and didn't really do much.

>> * Big allocations
>> It might be preferable to use mmap/munmap directly for big allocations
>> (bigger than a few pages) to avoid the overhead of trying to be smart.
>> These big allocations should be pretty few and shouldn't be that
>> pessimized by this. It also means some memory savings as you can
>> directly free memory from big allocations.
>
>
> I don't really see the approach I'm taking as "trying to be smart",  
> so much as "making it simple by using the same algorithm for almost  
> everything".  Many other malloc implementations I've examined use  
> three or more allocation approaches, depending on the allocation  
> size.  jemalloc uses only two, and huge allocations really are huge,  
> such that they rarely if ever happen.  In practice, I don't think  
> there is a real memory savings to be had.  Besides, madvise() calls  
> in strategic locations would achieve approximately the same result.
>
Yeah, saving them for future use through madvise might be better. I 
chose the route to keep the big allocations simple and avoid caching and 
looking up stuff for.
Basicly read the size and if it's big munmap it directly.

>> * Ownership
>> Didn't look that much for it so might have missed it. Basicly what
>> happens if you free memory allocated in another arena, do you return
>> memory to the arena the memory was allocated from or to the current
>> threads arena? Not returning it to the original arena leads to cache
>> line sharing. Returning it to the original adds some overhead and it
>> might be argued that applications doing this might not be considered
>> scalable and should be fixed instead.
>
>
> Allocated objects can be associated with their parent arenas in  
> constant time (by masking bits in the object addresses).  This makes  
> it possible to always free memory back to the same arena it came from.
>
Ok, good.

>> * Standalone version
>> Would be nice to have a standalone version to test the allocator on
>> solaris and linux to compare against their allocators. Would be  nice 
>> for
>> all the people with only access to SMP machines running another OS. I
>> tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
>> helpful in testing scalability and impact on small machines aswell as
>> finding bugs in the implementation. I tested my allocator against
>> Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x
>> UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my
>> allocator vs the FreeBSD libc on my 400MHz laptop with too much beer
>> and almost non working cpu fan. Beer as in standing in 10cm of beer in
>> my backpack =).
>> Not sure how much of the implementation is FreeBSD dependant, RB- 
>> trees?,
>
>
> I actually have a separate more portable implementation of jemalloc  
> that is part of a programming language runtime library (jemalloc is  
> simply a FreeBSD-ized stand-alone version of that allocator).  I've  
> tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris.  On Linux I've  
> compared against ptmalloc2, dlmalloc, and hoard.  Unfortunately, that  
> version of the allocator has to make some performance sacrifices  
> regarding locking, since it doesn't have access to the pthreads  
> internals.  As such, it requires a lot of care to interpret  
> performance results, so I haven't felt it to be worthwhile providing  
> that version here.
>
ok

>> * Focus - Single thread vs multithread
>> What are the focus of the allocator, how much memory is it worth to
>> waste to implement a good scalable implementation. This is the  reason I
>> think Solaris still keeps a serial allocator in libc and makes libumem
>> accessible through LD_PRELOAD and friends.
>
>
> jemalloc only creates two arenas for single-threaded applications  
> (one for internal memory management, and one for user requests), so  
> it doesn't really sacrifice much in the case of single-threaded  
> programs -- a few pages maybe.  In the case of multi-threaded  
> programs, it potentially sacrifices memory compactness in order to  
> reduce contention and cache line sharing.
>
As I have not tested it I take your word. Mostly wanted to know where 
you stand in the issue of pessimizing single-threaded apps vs MT. And if 
it's worth keeping phkmalloc for some apps.

>> * Comment
>> Might be wise to fix the comment about always using mmap since brk can
>> be used.
>
>
> Thanks, done.
>
> Jason


No problem, always fun to have some code that you actually understand 
the idea behind and see how it's implemented.

/Johan Bucht