From owner-freebsd-current@FreeBSD.ORG  Thu Sep 30 17:44:19 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9F2DD106564A
	for <freebsd-current@freebsd.org>; Thu, 30 Sep 2010 17:44:19 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
	by mx1.freebsd.org (Postfix) with ESMTP id EAAF38FC19
	for <freebsd-current@freebsd.org>; Thu, 30 Sep 2010 17:44:18 +0000 (UTC)
Received: (qmail 21251 invoked from network); 30 Sep 2010 17:36:16 -0000
Received: from unknown (HELO [62.48.0.92]) ([62.48.0.92])
	(envelope-sender <andre@freebsd.org>)
	by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
	for <mdf356@gmail.com>; 30 Sep 2010 17:36:16 -0000
Message-ID: <4CA4CC87.9060605@freebsd.org>
Date: Thu, 30 Sep 2010 19:44:39 +0200
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;
	rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4
MIME-Version: 1.0
To: Matthew Fleming <mdf356@gmail.com>
References: <4CA4BCD2.4070303@freebsd.org>
	<AANLkTi=2NXtVN=UyucJbbE4tzYk4DHw+31Mbkj9ZtULQ@mail.gmail.com>
In-Reply-To: <AANLkTi=2NXtVN=UyucJbbE4tzYk4DHw+31Mbkj9ZtULQ@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-hackers <freebsd-hackers@freebsd.org>, freebsd-current@freebsd.org
Subject: Re: Examining the VM splay tree effectiveness
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Sep 2010 17:44:19 -0000

On 30.09.2010 19:15, Matthew Fleming wrote:
> On Thu, Sep 30, 2010 at 9:37 AM, Andre Oppermann<andre@freebsd.org>  wrote:
>> Just for the kick of it I decided to take a closer look at the use of
>> splay trees (inherited from Mach if I read the history correctly) in
>> the FreeBSD VM system suspecting an interesting journey.
>>
>> The VM system has two major structures:
>>   1) the VM map which is per process and manages its address space
>>     by tracking all regions that are allocated and their permissions.
>>   2) the VM page table which is per VM map and provides the backing
>>     memory pages to the regions and interfaces with the pmap layer
>>     (virtual->physical).
>>
>> Both of these are very frequently accessed through memory allocations
>> from userspace and page faults from the pmap layer.  Their efficiency
>> and SMP scalability is crucial for many operations beginning with fork/
>> exec, going through malloc/mmap and ending with free/exit of a process.
>>
>> Both the vmmap and page table make use of splay trees to manage the
>> entries and to speed up lookups compared to long to traverse linked
>> lists or more memory expensive hash tables.  Some structures though
>> do have an additional linked list to simplify ordered traversals.
>>
>> A splay tree is an interesting binary search tree with insertion,
>> lookup and removal performed in O(log n) *amortized* time.  With
>> the *amortized* time being the crucial difference to other binary trees.
>> On every access *including* lookup it rotates the tree to make the
>> just found element the new root node.  For all gory details see:
>>   http://en.wikipedia.org/wiki/Splay_tree
>>
>> This behavior has a few important implications:
>>   1) the root node and constant rotation works like a cache with
>>     least recent looked up node at the top and less recently ones
>>     close to the top;
>>   2) every lookup that doesn't match the root node, ie. a cache miss,
>>     causes a rotation of the tree to make the newly found node the new
>>     root;
>>   3) during the rotate it has to re-arrange and touch possibly a large
>>     number of other nodes;
>>   4) in worst case the tree may resemble a linked list.  A splay tree
>>     makes no guarantee that it is balanced.
>>
>> For the kernel and the splay tree usage in the VM map and page table
>> some more implications come up:
>>   a) the amortization effect/cost only balance if the majority of
>>     lookups are root node (cache) hits, otherwise the cost of
>>     rebalancing destroys any advantage and quickly turns into a
>>     large overhead.
>>   b) the overhead becomes even worse due to touching many nodes and
>>     causing a lot of CPU cache line dirtying.  For processes with
>>     shared memory or threads across CPU's this overhead becomes
>>     almost excessive because the CPU's stomp on each others cached
>>     nodes and get their own caches invalidated.  The penalty is a
>>     high-latency memory refetch not only for the root node lookup
>>     but every hop in the following rebalancing operation =>  thrashing.
>>
>> To quantify the positive and negative effects of the splay tree I
>> instrumented the code and added counters to track the behavior of
>> the splay tree in the vmmap and page table.
>>
>> The test case is an AMD64 kernel build to get a first overview.
>> Other system usages may have different results depending on their
>> fork and memory usage patters.
>>
>> The results are very interesting:
>>
>>   The page table shows a *very* poor root node hit rate and an excessive
>>   amount of rotational node modifications representing almost the
>>   worst case for splay trees.
>>
>> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
>>
>>   The vmmap shows a high root node hit rate but still a significant
>>   amount of rotational node modifications.
>>
>> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
>>
>>  From these observations I come to the conclusion that a splay tree
>> is suboptimal for the normal case and quickly horrible even on small
>> deviations from the normal case in the vmmap.  For the page table it
>> is always horrible.  Not to mention the locking complications due to
>> modifying the tree on lookups.
>>
>> Based on an examination of other binary trees I put forward the
>> hypothesis that a Red-Black tree is a much better, if not the optimal,
>> fit for the vmmap and page table.  RB trees are balanced binary trees
>> and O(log n) in all cases.  The big advantage in this context is that
>> lookups are pure reads and do not cause CPU cache invalidations on
>> other CPU's and always only require a read lock without the worst
>> case properties of the unbalanced splay tree.  The high cache locality
>> of the vmmap lookups can be used with the RB tree as well by simply
>> adding a pointer to the least recently found node.  To prevent write locking
>> this can be done lazily.  More profiling is required to make
>> a non-speculative statement on this though.  In addition a few of
>> the additional linked lists that currently accompany the vmmap and
>> page structures are no longer necessary as they easily can be done
>> with standard RB tree accessors.  Our standard userspace jemalloc
>> also uses RB trees for its internal housekeeping.  RB tree details:
>>   http://en.wikipedia.org/wiki/Red-black_tree
>>
>> I say hypothesis because I haven't measured the difference to an
>> RB tree implementation yet.  I've hacked up a crude and somewhat
>> mechanical patch to convert the vmmap and page VM structures to
>> use RB trees, the vmmap part is not stable yet.  The page part
>> seems to work fine though.
>>
>> This is what I've hacked together so far:
>>   http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
>>   http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff
>>
>> The diffs are still in their early stages and do not make use of
>> any code simplifications becoming possible by using RB trees instead
>> of the current splay trees.
>>
>> Comments on the VM issue and splay vs. RB tree hypothesis welcome.
>
> Do you have actual performance numbers from a real workload?

No, not yet.  I obviously would have posted those numbers as well.
What do you consider a representative real workload?

> I implemented an AA tree locally (basically a 2,3 tree that uses
> coloring to represent the 3 nodes) and the performance was much worse
> than the splay tree.  I assume this is due to the overhead of keeping
> the tree balanced; I didn't instrument either the splay or the AA
> code.  I suspect that the overhead of an RB tree would similarly wash
> out the benefits of O(log_2 n) time, but this is theory.  The facts
> would be measuring several different workloads an looking at the
> system/real times for them between the two solutions.

I think there are two pronounced effects to consider:
  a) the algorithmic complexity
  b) the real world implications of said complexity and operations
     on today's CPU's multi-hierarchy caches and associated SMP side
     effects.

> We've talked internally at $work about using a B+-tree with maybe
> branching factor 5-7; whatever makes sense for the size of a cache
> line.  This seems likely to be better for performance than an RB-tree
> but does require a lot more changes, since separate memory is needed
> for the tree's nodes outside the vm_page structure.  There just hasn't
> been time to implement it and try it out.
 >
> Unfortunately I won't have the time to experiment at $work for a few
> months on this problem.

-- 
Andre