From owner-freebsd-current@FreeBSD.ORG Thu Sep 30 16:37:20 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 60FCE1065674 for ; Thu, 30 Sep 2010 16:37:19 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id D785C8FC08 for ; Thu, 30 Sep 2010 16:37:18 +0000 (UTC) Received: (qmail 20758 invoked from network); 30 Sep 2010 16:29:15 -0000 Received: from unknown (HELO [62.48.0.92]) ([62.48.0.92]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Sep 2010 16:29:15 -0000 Message-ID: <4CA4BCD2.4070303@freebsd.org> Date: Thu, 30 Sep 2010 18:37:38 +0200 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4 MIME-Version: 1.0 To: freebsd-current@freebsd.org, freebsd-hackers Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Subject: Examining the VM splay tree effectiveness X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Sep 2010 16:37:20 -0000 Just for the kick of it I decided to take a closer look at the use of splay trees (inherited from Mach if I read the history correctly) in the FreeBSD VM system suspecting an interesting journey. The VM system has two major structures: 1) the VM map which is per process and manages its address space by tracking all regions that are allocated and their permissions. 2) the VM page table which is per VM map and provides the backing memory pages to the regions and interfaces with the pmap layer (virtual->physical). Both of these are very frequently accessed through memory allocations from userspace and page faults from the pmap layer. Their efficiency and SMP scalability is crucial for many operations beginning with fork/ exec, going through malloc/mmap and ending with free/exit of a process. Both the vmmap and page table make use of splay trees to manage the entries and to speed up lookups compared to long to traverse linked lists or more memory expensive hash tables. Some structures though do have an additional linked list to simplify ordered traversals. A splay tree is an interesting binary search tree with insertion, lookup and removal performed in O(log n) *amortized* time. With the *amortized* time being the crucial difference to other binary trees. On every access *including* lookup it rotates the tree to make the just found element the new root node. For all gory details see: http://en.wikipedia.org/wiki/Splay_tree This behavior has a few important implications: 1) the root node and constant rotation works like a cache with least recent looked up node at the top and less recently ones close to the top; 2) every lookup that doesn't match the root node, ie. a cache miss, causes a rotation of the tree to make the newly found node the new root; 3) during the rotate it has to re-arrange and touch possibly a large number of other nodes; 4) in worst case the tree may resemble a linked list. A splay tree makes no guarantee that it is balanced. For the kernel and the splay tree usage in the VM map and page table some more implications come up: a) the amortization effect/cost only balance if the majority of lookups are root node (cache) hits, otherwise the cost of rebalancing destroys any advantage and quickly turns into a large overhead. b) the overhead becomes even worse due to touching many nodes and causing a lot of CPU cache line dirtying. For processes with shared memory or threads across CPU's this overhead becomes almost excessive because the CPU's stomp on each others cached nodes and get their own caches invalidated. The penalty is a high-latency memory refetch not only for the root node lookup but every hop in the following rebalancing operation => thrashing. To quantify the positive and negative effects of the splay tree I instrumented the code and added counters to track the behavior of the splay tree in the vmmap and page table. The test case is an AMD64 kernel build to get a first overview. Other system usages may have different results depending on their fork and memory usage patters. The results are very interesting: The page table shows a *very* poor root node hit rate and an excessive amount of rotational node modifications representing almost the worst case for splay trees. http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies The vmmap shows a high root node hit rate but still a significant amount of rotational node modifications. http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies From these observations I come to the conclusion that a splay tree is suboptimal for the normal case and quickly horrible even on small deviations from the normal case in the vmmap. For the page table it is always horrible. Not to mention the locking complications due to modifying the tree on lookups. Based on an examination of other binary trees I put forward the hypothesis that a Red-Black tree is a much better, if not the optimal, fit for the vmmap and page table. RB trees are balanced binary trees and O(log n) in all cases. The big advantage in this context is that lookups are pure reads and do not cause CPU cache invalidations on other CPU's and always only require a read lock without the worst case properties of the unbalanced splay tree. The high cache locality of the vmmap lookups can be used with the RB tree as well by simply adding a pointer to the least recently found node. To prevent write locking this can be done lazily. More profiling is required to make a non-speculative statement on this though. In addition a few of the additional linked lists that currently accompany the vmmap and page structures are no longer necessary as they easily can be done with standard RB tree accessors. Our standard userspace jemalloc also uses RB trees for its internal housekeeping. RB tree details: http://en.wikipedia.org/wiki/Red-black_tree I say hypothesis because I haven't measured the difference to an RB tree implementation yet. I've hacked up a crude and somewhat mechanical patch to convert the vmmap and page VM structures to use RB trees, the vmmap part is not stable yet. The page part seems to work fine though. This is what I've hacked together so far: http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff The diffs are still in their early stages and do not make use of any code simplifications becoming possible by using RB trees instead of the current splay trees. Comments on the VM issue and splay vs. RB tree hypothesis welcome. -- Andre