From owner-freebsd-arch@FreeBSD.ORG Mon Feb 5 00:00:33 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C605116A400; Mon, 5 Feb 2007 00:00:32 +0000 (UTC) (envelope-from alc@cs.rice.edu) Received: from mail.cs.rice.edu (mail.cs.rice.edu [128.42.1.31]) by mx1.freebsd.org (Postfix) with ESMTP id 7B43113C491; Mon, 5 Feb 2007 00:00:32 +0000 (UTC) (envelope-from alc@cs.rice.edu) Received: from mail.cs.rice.edu (localhost.localdomain [127.0.0.1]) by mail.cs.rice.edu (Postfix) with ESMTP id 833BA2C2AC9; Sun, 4 Feb 2007 17:41:10 -0600 (CST) X-Virus-Scanned: by amavis-2.4.0 at mail.cs.rice.edu Received: from mail.cs.rice.edu ([127.0.0.1]) by mail.cs.rice.edu (mail.cs.rice.edu [127.0.0.1]) (amavisd-new, port 10024) with LMTP id GXjMwLHzpR6a; Sun, 4 Feb 2007 17:41:09 -0600 (CST) Received: from [216.63.78.18] (adsl-216-63-78-18.dsl.hstntx.swbell.net [216.63.78.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.cs.rice.edu (Postfix) with ESMTP id A0B822C2ABD; Sun, 4 Feb 2007 17:41:09 -0600 (CST) Message-ID: <45C66F14.2020907@cs.rice.edu> Date: Sun, 04 Feb 2007 17:41:08 -0600 From: Alan Cox User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.13) Gecko/20061115 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Suleiman Souhlal References: <25784.67.188.127.3.1170310494.squirrel@ssl.mu.org> In-Reply-To: <25784.67.188.127.3.1170310494.squirrel@ssl.mu.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: alc@freebsd.org, arch@freebsd.org Subject: Re: Reducing vm page queue mutex contention X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Feb 2007 00:00:33 -0000 Suleiman Souhlal wrote: >Hello Alan, > >Profiling shows that the vm page queue mutex is the most contended lock in >the kernel, maybe apart from sched_lock. It seems that this is in part >because this lock protects a lot of things: page queues, pv entries, page >flags, page hold count, page wired count.. > > > To start, I would concentrate on entirely eliminating the use of the page queues lock from pmap_enter() and pmap_removes_pages(). I've inlined specific comments below. >I came up with a possible plan to reduce contention on this lock, >concentrating on the amd64 pmap (although these should be applicable to >the other architectures as well): > >- Make vm_page_flag_set/clear() just use atomic operations to get rid of > the page queues lock dependency. > I'm still not entirely convinced this is entirely safe. > > > I would not do this. Instead, I would create a separate machine-dependent flags field that is synchronized by the same lock as the pv lists. This field would then be used for flags such as PG_REFERENCED and PG_WRITEABLE. (In fact, I believe that there is wasted space due to alignment in amd64's page structure that could be used for this.) [snip] >- Add a mutex pool for vm pages to protect the pv entries lists. > I'm currently working on this. > My current approach makes struct pv_entry larger because it needs to store > a pointer to the pte in each pv_entry. > > > While a mutex pool may ultimately be needed, I would start with a simpler approach and then reevaluate what should be the next step. Specifically, I would start with a single lock for the pv lists and machine-dependent flags. Then, you won't need the pointer. Moreover, the use of a mutex pool is going to add overhead to the ranged operations like pmap_remove_pages() and pmap_remove(). > Another way that might be better is to move to per-object pv entries, > which is what Linux does. It would greatly reduce memory usage when > mapping large objects in a lot of processes, although it might be slower > for sparsely faulted objects mapped in a large number of processes. > This approach would be a lot of work, which is why I'm leaning towards > keeping per-page pv entries. > > > I have addressed this problem in a different way. With the superpages support in perforce, I create a single pv entry per superpage mapping. >- It should be possible to make vm_page->wired_count use atomic operations > instead of needing a lock, similarly to what I did for the hold_count. > This might be a bit tricky, but hopefully possible. > Alternatively, we could use the mutex pool described above to protect it. > > > Page table page's (ab)use their wire_count field as a reference count. The pmap lock is already held whenever this count is changed, or more generally when the page table is changed. So, at least for page table pages nothing needs to change. >- We can change pmap_unuse_pt and free_pv_entry to just mark the pages they > want to free in an array allocated by the caller. > The caller will then free those pages after it drops the pmap lock. > > For example: > > struct pages_to_free { > vm_page_t *page[MAX_PAGES]; > int num_pages; > }; > > void pmap_remove(...) > { > struct pages_to_free pages; > > PMAP_LOCK(pmap); > ... > pmap_unuse_pt(..., &pages); > ... > PMAP_UNLOCK(pmap); > vm_page_lock_queues(); > for (i = 0; i < pages.num_pages; i++) > vm_page_free(pages.page[i]); > vm_page_unlock_queues(); > } > > This way, pmap_remove can be mostly without queues lock. > > > I can make vm_page_free() callable without the page queues lock for page table and pv entry pages, i.e., pages that don't belong to a vm object. Then, you don't need to do this. However, there is another issue that you don't touch on here. The page queues lock is being used to synchronize changes to the page's dirty field and the PTE's PG_M bit against testing for dirty pages in the machine-independent (MI) code. There are ordering issues in both the pmap and MI code, e.g., pmap_enter() clears PG_M on an old mapping before setting the page's dirty field and vm_page_dontneed() reads the page's dirty field and then conditionally call pmap_is_modified(). >- Once the above are done, it should be possible to make pmap_enter() run > mostly queue lock free by: > - Pre-allocating a pv chunk early in pmap_enter, if there are no free > ones, so that we never have to allocate new chunks in pmap_insert_entry. > - Dropping the page queues lock immediately after the pmap_allocpte in > pmap_enter. > > Actually, you shouldn't need to acquire the page queues lock at all or move the allocation of a pv_chunk. Regards, Alan