From owner-freebsd-current@FreeBSD.ORG Thu Sep 30 17:44:19 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9F2DD106564A for ; Thu, 30 Sep 2010 17:44:19 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id EAAF38FC19 for ; Thu, 30 Sep 2010 17:44:18 +0000 (UTC) Received: (qmail 21251 invoked from network); 30 Sep 2010 17:36:16 -0000 Received: from unknown (HELO [62.48.0.92]) ([62.48.0.92]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 30 Sep 2010 17:36:16 -0000 Message-ID: <4CA4CC87.9060605@freebsd.org> Date: Thu, 30 Sep 2010 19:44:39 +0200 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4 MIME-Version: 1.0 To: Matthew Fleming References: <4CA4BCD2.4070303@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers , freebsd-current@freebsd.org Subject: Re: Examining the VM splay tree effectiveness X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Sep 2010 17:44:19 -0000 On 30.09.2010 19:15, Matthew Fleming wrote: > On Thu, Sep 30, 2010 at 9:37 AM, Andre Oppermann wrote: >> Just for the kick of it I decided to take a closer look at the use of >> splay trees (inherited from Mach if I read the history correctly) in >> the FreeBSD VM system suspecting an interesting journey. >> >> The VM system has two major structures: >> 1) the VM map which is per process and manages its address space >> by tracking all regions that are allocated and their permissions. >> 2) the VM page table which is per VM map and provides the backing >> memory pages to the regions and interfaces with the pmap layer >> (virtual->physical). >> >> Both of these are very frequently accessed through memory allocations >> from userspace and page faults from the pmap layer. Their efficiency >> and SMP scalability is crucial for many operations beginning with fork/ >> exec, going through malloc/mmap and ending with free/exit of a process. >> >> Both the vmmap and page table make use of splay trees to manage the >> entries and to speed up lookups compared to long to traverse linked >> lists or more memory expensive hash tables. Some structures though >> do have an additional linked list to simplify ordered traversals. >> >> A splay tree is an interesting binary search tree with insertion, >> lookup and removal performed in O(log n) *amortized* time. With >> the *amortized* time being the crucial difference to other binary trees. >> On every access *including* lookup it rotates the tree to make the >> just found element the new root node. For all gory details see: >> http://en.wikipedia.org/wiki/Splay_tree >> >> This behavior has a few important implications: >> 1) the root node and constant rotation works like a cache with >> least recent looked up node at the top and less recently ones >> close to the top; >> 2) every lookup that doesn't match the root node, ie. a cache miss, >> causes a rotation of the tree to make the newly found node the new >> root; >> 3) during the rotate it has to re-arrange and touch possibly a large >> number of other nodes; >> 4) in worst case the tree may resemble a linked list. A splay tree >> makes no guarantee that it is balanced. >> >> For the kernel and the splay tree usage in the VM map and page table >> some more implications come up: >> a) the amortization effect/cost only balance if the majority of >> lookups are root node (cache) hits, otherwise the cost of >> rebalancing destroys any advantage and quickly turns into a >> large overhead. >> b) the overhead becomes even worse due to touching many nodes and >> causing a lot of CPU cache line dirtying. For processes with >> shared memory or threads across CPU's this overhead becomes >> almost excessive because the CPU's stomp on each others cached >> nodes and get their own caches invalidated. The penalty is a >> high-latency memory refetch not only for the root node lookup >> but every hop in the following rebalancing operation => thrashing. >> >> To quantify the positive and negative effects of the splay tree I >> instrumented the code and added counters to track the behavior of >> the splay tree in the vmmap and page table. >> >> The test case is an AMD64 kernel build to get a first overview. >> Other system usages may have different results depending on their >> fork and memory usage patters. >> >> The results are very interesting: >> >> The page table shows a *very* poor root node hit rate and an excessive >> amount of rotational node modifications representing almost the >> worst case for splay trees. >> >> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies >> >> The vmmap shows a high root node hit rate but still a significant >> amount of rotational node modifications. >> >> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies >> >> From these observations I come to the conclusion that a splay tree >> is suboptimal for the normal case and quickly horrible even on small >> deviations from the normal case in the vmmap. For the page table it >> is always horrible. Not to mention the locking complications due to >> modifying the tree on lookups. >> >> Based on an examination of other binary trees I put forward the >> hypothesis that a Red-Black tree is a much better, if not the optimal, >> fit for the vmmap and page table. RB trees are balanced binary trees >> and O(log n) in all cases. The big advantage in this context is that >> lookups are pure reads and do not cause CPU cache invalidations on >> other CPU's and always only require a read lock without the worst >> case properties of the unbalanced splay tree. The high cache locality >> of the vmmap lookups can be used with the RB tree as well by simply >> adding a pointer to the least recently found node. To prevent write locking >> this can be done lazily. More profiling is required to make >> a non-speculative statement on this though. In addition a few of >> the additional linked lists that currently accompany the vmmap and >> page structures are no longer necessary as they easily can be done >> with standard RB tree accessors. Our standard userspace jemalloc >> also uses RB trees for its internal housekeeping. RB tree details: >> http://en.wikipedia.org/wiki/Red-black_tree >> >> I say hypothesis because I haven't measured the difference to an >> RB tree implementation yet. I've hacked up a crude and somewhat >> mechanical patch to convert the vmmap and page VM structures to >> use RB trees, the vmmap part is not stable yet. The page part >> seems to work fine though. >> >> This is what I've hacked together so far: >> http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff >> http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff >> >> The diffs are still in their early stages and do not make use of >> any code simplifications becoming possible by using RB trees instead >> of the current splay trees. >> >> Comments on the VM issue and splay vs. RB tree hypothesis welcome. > > Do you have actual performance numbers from a real workload? No, not yet. I obviously would have posted those numbers as well. What do you consider a representative real workload? > I implemented an AA tree locally (basically a 2,3 tree that uses > coloring to represent the 3 nodes) and the performance was much worse > than the splay tree. I assume this is due to the overhead of keeping > the tree balanced; I didn't instrument either the splay or the AA > code. I suspect that the overhead of an RB tree would similarly wash > out the benefits of O(log_2 n) time, but this is theory. The facts > would be measuring several different workloads an looking at the > system/real times for them between the two solutions. I think there are two pronounced effects to consider: a) the algorithmic complexity b) the real world implications of said complexity and operations on today's CPU's multi-hierarchy caches and associated SMP side effects. > We've talked internally at $work about using a B+-tree with maybe > branching factor 5-7; whatever makes sense for the size of a cache > line. This seems likely to be better for performance than an RB-tree > but does require a lot more changes, since separate memory is needed > for the tree's nodes outside the vm_page structure. There just hasn't > been time to implement it and try it out. > > Unfortunately I won't have the time to experiment at $work for a few > months on this problem. -- Andre