From owner-freebsd-current@FreeBSD.ORG Tue Mar 7 06:52:01 2006 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2300A16A420; Tue, 7 Mar 2006 06:52:01 +0000 (GMT) (envelope-from jasone@FreeBSD.org) Received: from lh.synack.net (lh.synack.net [204.152.188.37]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3A98443D45; Tue, 7 Mar 2006 06:52:00 +0000 (GMT) (envelope-from jasone@FreeBSD.org) Received: by lh.synack.net (Postfix, from userid 100) id A8D015E4905; Mon, 6 Mar 2006 22:51:59 -0800 (PST) Received: from [192.168.168.201] (moscow-cuda-gen2-68-64-60-20.losaca.adelphia.net [68.64.60.20]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by lh.synack.net (Postfix) with ESMTP id 49B675E48FC; Mon, 6 Mar 2006 22:51:54 -0800 (PST) Message-ID: <440D2D86.3000802@FreeBSD.org> Date: Mon, 06 Mar 2006 22:51:50 -0800 From: Jason Evans User-Agent: Mozilla Thunderbird 1.0.7-1.4.1 (X11/20050929) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Daniel O'Connor References: <200601301652.16237.doconnor@gsoft.com.au> <200602162202.51872.doconnor@gsoft.com.au> <43F52E61.10007@FreeBSD.org> <200602201040.48083.doconnor@gsoft.com.au> In-Reply-To: <200602201040.48083.doconnor@gsoft.com.au> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on lh.synack.net X-Spam-Level: * X-Spam-Status: No, score=1.8 required=5.0 tests=RCVD_IN_NJABL_DUL, RCVD_IN_SORBS_DUL autolearn=no version=3.0.4 Cc: freebsd-current@freebsd.org, Michael Nottebrock , Kris Kennaway Subject: New jemalloc patch (was Re: KDE 3.5.0 seems much chubbier than 3.4.2) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Mar 2006 06:52:01 -0000 Daniel O'Connor wrote: > On Friday 17 February 2006 12:31, Jason Evans wrote: >>Can you tell me which programs are particularly bad, and if you are >>using them in any particular ways that are important to reproducing the >>high memory usage? I don't generally use KDE, so any details you >>provide are likely to help. > > Hmm, well it seems XOrg, Amarok, Kopete, Konqueror and KMail show up as big > users. > > I have a largish MP3 collection (~7000 songs) loaded into Amarok, I have set > the number of history items in Kopete to be 250, Konqueror has about 10 tabs > open and KMail is setup to used cached-imap with my email accounts (162 > folders, ~10000 messages) > > With phkmalloc I am seeing Xorg use 110M/80M (size/res), amarok uses 93M/73M, > Kopete uses 82M/56M, Konqueror uses 81M/68M, and KMail uses 68M/53M. > > With jemalloc I saw Xorg use 213M/50M, amarok - 213M/50M, Kopete 119M/7.3M, > Konq - 260M/67M (guessed), and KMail - 137M/51M. [Short summary: jemalloc fragments memory badly under some conditions. I've re-architected jemalloc to fix this, and a patch is available for the brave (URL below).] I've looked into this in some detail, and have determined that KDE apps exhibit an allocation pattern that causes jemalloc to fragment memory somewhat badly. In brief, KDE apps are allocating moderate numbers of objects across a broad spectrum of sizes, throughout the app lifetimes. Evenly mixing objects of many different sizes together tends to make later coalescence and re-use of memory more difficult. The following is a summary of konqueror's allocation activity (I opened 8 tabs to various websites, clicked around a bit, then exited): ___ Begin malloc statistics ___ Number of CPUs: 1 Number of arenas: 1 Chunk size: 2097152 (2^21) Quantum size: 16 (2^4) Max small size: 512 Pointer size: 4 Assertions enabled Allocated: 4290957716, space used: 48234496 ^^^ Stats bug chunks: nchunks highchunks curchunks 80 26 23 huge: nmalloc ndalloc allocated 2 2 0 arenas[0] statistics: calls: nmalloc npalloc ncalloc ndalloc nralloc 7512580 0 4962691 22362166 9906573 bins: bin size nregs run_size nrequests nruns highruns curruns 0 T 2 1906 4096 27416 1 1 1 1 T 4 953 4096 391123 9 9 9 2 T 8 988 8192 283348 21 16 14 3 Q 16 1006 16384 3384383 291 257 188 4 Q 32 1015 32768 8044531 232 207 153 5 Q 48 1359 65536 705179 46 45 33 6 Q 64 1019 65536 597042 31 26 22 7 Q 80 1634 131072 417756 12 12 9 8 Q 96 1362 131072 370886 3 3 3 9 Q 112 1167 131072 368970 5 5 4 10 Q 128 1021 131072 387485 3 3 3 11 Q 144 1818 262144 339876 1 1 1 12 Q 160 1636 262144 334701 1 1 1 13 Q 176 1487 262144 335299 2 2 2 14 Q 192 1363 262144 332837 1 1 1 15 Q 208 1258 262144 351467 2 2 2 16 Q 224 1169 262144 345241 1 1 1 17 Q 240 1091 262144 332729 1 1 1 18 Q 256 1022 262144 359467 1 1 1 19 Q 272 1926 524288 331321 1 1 1 20 Q 288 1819 524288 330817 1 1 1 21 Q 304 1723 524288 331117 1 1 1 22 Q 320 1637 524288 328254 1 1 1 23 Q 336 1559 524288 320970 1 1 1 24 Q 352 1488 524288 308898 1 1 1 25 Q 368 1423 524288 254787 1 1 1 26 Q 384 1364 524288 110447 1 1 1 27 Q 400 1310 524288 112540 1 1 1 28 Q 416 1259 524288 109898 1 1 1 29 Q 432 1212 524288 109902 1 1 1 30 Q 448 1169 524288 110175 1 1 1 31 Q 464 1129 524288 109667 1 1 1 32 Q 480 1091 524288 110484 1 1 1 33 Q 496 1056 524288 110610 1 1 1 34 Q 512 1023 524288 111992 1 1 1 35 P 1024 1023 1048576 1406248 1 1 1 36 P 2048 511 1048576 24986 1 1 1 --- End malloc statistics --- The 'nrequests' column indicates the total number of requests for each size class. The somewhat even distribution from 96..512 bytes (actually ..~750 bytes, as discovered in other tests) is rather unusual. jemalloc simply does not deal well with this. This, in combination with some interesting ideas I came across while writing a malloc paper for BSDcan 2006, caused me to prototype a very different architecture that what jemalloc currently uses. The prototype looked promising, so I've polished it up. In short, this new jemalloc has the following characteristics: * Still uses chunks, managed by multiple arenas. Thus, the SMP scalability should be unchanged. * Carves chunks into page "runs" using a power-of-two buddy allocation scheme. * Allocates only one size class from each run, and uses a bitmap to track which regions are in use, so that regions can be tightly packed. * Auto-sizes runs so that the proportion of data structure overhead is quite low, even for the largest sub-pagesize size classes (see 'run_size' column in above stats output). * Uses power-of-two size classes below the quantum (quantum is 16 bytes for most architectures), quantum-spaced size classes up to a threshold (512 bytes by default), and power-of-two size classes above the threshold (up to 1/2 page). Runs are used directly for allocations that are larger than 1/2 page, but no larger than 1/2 chunk. Above 1/2 chunk, a multiple of the chunk size is used. I've been running variants of this allocator for over a week now, and it appears to be pretty solid. Its speed appears to be good. Memory usage is much improved, with one exception: small apps tend to fault in a few more pages than before, since even a single allocation of a size class causes a page to be faulted in. This amounts to a bounded constant overhead that goes away as an application's memory usage increases to fill those pages. A patch is available at: http://people.freebsd.org/~jasone/jemalloc/patches/jemalloc_20060306a.diff See the man page in the patch for runtime configuration options. Here's the C file: http://people.freebsd.org/~jasone/jemalloc/patches/jemalloc_20060306a.c I'm primarily interested in feedback regarding the following: * Stability * Speed * Memory usage Thanks, Jason