From owner-freebsd-arch@FreeBSD.ORG  Mon Feb  5 00:00:33 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@freebsd.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id C605116A400;
	Mon,  5 Feb 2007 00:00:32 +0000 (UTC) (envelope-from alc@cs.rice.edu)
Received: from mail.cs.rice.edu (mail.cs.rice.edu [128.42.1.31])
	by mx1.freebsd.org (Postfix) with ESMTP id 7B43113C491;
	Mon,  5 Feb 2007 00:00:32 +0000 (UTC) (envelope-from alc@cs.rice.edu)
Received: from mail.cs.rice.edu (localhost.localdomain [127.0.0.1])
	by mail.cs.rice.edu (Postfix) with ESMTP id 833BA2C2AC9;
	Sun,  4 Feb 2007 17:41:10 -0600 (CST)
X-Virus-Scanned: by amavis-2.4.0 at mail.cs.rice.edu
Received: from mail.cs.rice.edu ([127.0.0.1])
	by mail.cs.rice.edu (mail.cs.rice.edu [127.0.0.1]) (amavisd-new,
	port 10024)
	with LMTP id GXjMwLHzpR6a; Sun,  4 Feb 2007 17:41:09 -0600 (CST)
Received: from [216.63.78.18] (adsl-216-63-78-18.dsl.hstntx.swbell.net
	[216.63.78.18])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.cs.rice.edu (Postfix) with ESMTP id A0B822C2ABD;
	Sun,  4 Feb 2007 17:41:09 -0600 (CST)
Message-ID: <45C66F14.2020907@cs.rice.edu>
Date: Sun, 04 Feb 2007 17:41:08 -0600
From: Alan Cox <alc@cs.rice.edu>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.13) Gecko/20061115
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Suleiman Souhlal <ssouhlal@FreeBSD.org>
References: <25784.67.188.127.3.1170310494.squirrel@ssl.mu.org>
In-Reply-To: <25784.67.188.127.3.1170310494.squirrel@ssl.mu.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: alc@freebsd.org, arch@freebsd.org
Subject: Re: Reducing vm page queue mutex contention
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 05 Feb 2007 00:00:33 -0000

Suleiman Souhlal wrote:

>Hello Alan,
>
>Profiling shows that the vm page queue mutex is the most contended lock in
>the kernel, maybe apart from sched_lock. It seems that this is in part
>because this lock protects a lot of things: page queues, pv entries, page
>flags, page hold count, page wired count..
>
>  
>

To start, I would concentrate on entirely eliminating the use of the 
page queues lock from pmap_enter() and pmap_removes_pages().  I've 
inlined specific comments below.

>I came up with a possible plan to reduce contention on this lock,
>concentrating on the amd64 pmap (although these should be applicable to
>the other architectures as well):
>
>- Make vm_page_flag_set/clear() just use atomic operations to get rid of
>  the page queues lock dependency.
>  I'm still not entirely convinced this is entirely safe.
>
>  
>

I would not do this.  Instead, I would create a separate 
machine-dependent flags field that is synchronized by the same lock as 
the pv lists.  This field would then be used for flags such as  
PG_REFERENCED and PG_WRITEABLE.  (In fact, I believe that there is 
wasted space due to alignment in amd64's page structure that could be 
used for this.)

[snip]

>- Add a mutex pool for vm pages to protect the pv entries lists.
>  I'm currently working on this.
>  My current approach makes struct pv_entry larger because it needs to store
>  a pointer to the pte in each pv_entry.
>
>  
>

While a mutex pool may ultimately be needed, I would start with a 
simpler approach and then reevaluate what should be the next step.  
Specifically, I would start with a single lock for the pv lists and 
machine-dependent flags.  Then, you won't need the pointer.  Moreover, 
the use of a mutex pool is going to add overhead to the ranged 
operations like pmap_remove_pages() and pmap_remove().

>  Another way that might be better is to move to per-object pv entries,
>  which is what Linux does. It would greatly reduce memory usage when
>  mapping large objects in a lot of processes, although it might be slower
>  for sparsely faulted objects mapped in a large number of processes.
>  This approach would be a lot of work, which is why I'm leaning towards
>  keeping per-page pv entries.
>
>  
>

I have addressed this problem in a different way.  With the superpages 
support in perforce, I create a single pv entry per superpage mapping.

>- It should be possible to make vm_page->wired_count use atomic operations
>  instead of needing a lock, similarly to what I did for the hold_count.
>  This might be a bit tricky, but hopefully possible.
>  Alternatively, we could use the mutex pool described above to protect it.
>
>  
>

Page table page's (ab)use their wire_count field as a reference count.  
The pmap lock is already held whenever this count is changed, or more 
generally when the page table is changed.  So, at least for page table 
pages nothing needs to change.

>- We can change pmap_unuse_pt and free_pv_entry to just mark the pages they
>  want to free in an array allocated by the caller.
>  The caller will then free those pages after it drops the pmap lock.
>
>  For example:
>
>  struct pages_to_free {
>          vm_page_t *page[MAX_PAGES];
>          int num_pages;
>  };
>
>  void pmap_remove(...)
>  {
>      struct pages_to_free pages;
>
>      PMAP_LOCK(pmap);
>      ...
>      pmap_unuse_pt(..., &pages);
>      ...
>      PMAP_UNLOCK(pmap);
>      vm_page_lock_queues();
>      for (i = 0; i < pages.num_pages; i++)
>          vm_page_free(pages.page[i]);
>      vm_page_unlock_queues();
>  }
>
>  This way, pmap_remove can be mostly without queues lock.
>
>  
>

I can make vm_page_free() callable without the page queues lock for page 
table and pv entry pages, i.e., pages that don't belong to a vm object.  
Then, you don't need to do this.

However, there is another issue that you don't touch on here.   The page 
queues lock is being used to synchronize changes to the page's dirty 
field and the PTE's PG_M bit against testing for dirty pages in the 
machine-independent (MI) code.  There are ordering issues in both the 
pmap and MI code, e.g., pmap_enter() clears PG_M on an old mapping 
before setting the page's dirty field and vm_page_dontneed() reads the 
page's dirty field and then conditionally call pmap_is_modified().

>- Once the above are done, it should be possible to make pmap_enter() run
>  mostly queue lock free by:
>  - Pre-allocating a pv chunk early in pmap_enter, if there are no free
>    ones, so that we never have to allocate new chunks in pmap_insert_entry.
>  - Dropping the page queues lock immediately after the pmap_allocpte in
>    pmap_enter.
>  
>

Actually, you shouldn't need to acquire the page queues lock at all or 
move the allocation of a pv_chunk.

Regards,
Alan