From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 19:28:49 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C15B4BF7; Wed, 19 Dec 2012 19:28:49 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 2C36B8FC14; Wed, 19 Dec 2012 19:28:48 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBJJSces060051; Wed, 19 Dec 2012 21:28:38 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBJJSces060051 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBJJSc9E060050; Wed, 19 Dec 2012 21:28:38 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 19 Dec 2012 21:28:38 +0200 From: Konstantin Belousov To: alc@freebsd.org Subject: Re: Unmapped I/O Message-ID: <20121219192838.GZ71906@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="5RB7GLe/slk02+tJ" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 19:28:49 -0000 --5RB7GLe/slk02+tJ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Dec 19, 2012 at 12:58:41PM -0600, Alan Cox wrote: > On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov wrote: >=20 > > One of the known FreeBSD I/O path performance bootleneck is the > > neccessity to map each I/O buffer pages into KVA. The problem is that > > on the multi-core machines, the mapping must flush TLB on all cores, > > due to the global mapping of the buffer pages into the kernel. This > > means that buffer creation and destruction disrupts execution of all > > other cores to perform TLB shootdown through IPI, and the thread > > initiating the shootdown must wait for all other cores to execute and > > report. > > > > The patch at > > http://people.freebsd.org/~kib/misc/unmapped.4.patch > > implements the 'unmapped buffers'. It means an ability to create the > > VMIO struct buf, which does not point to the KVA mapping the buffer > > pages to the kernel addresses. Since there is no mapping, kernel does > > not need to clear TLB. The unmapped buffers are marked with the new > > B_NOTMAPPED flag, and should be requested explicitely using the > > GB_NOTMAPPED flag to the buffer allocation routines. If the mapped > > buffer is requested but unmapped buffer already exists, the buffer > > subsystem automatically maps the pages. > > > > The clustering code is also made aware of the not-mapped buffers, but > > this required the KPI change that accounts for the diff in the non-UFS > > filesystems. > > > > UFS is adopted to request not mapped buffers when kernel does not need > > to access the content, i.e. mostly for the file data. New helper > > function vn_io_fault_pgmove() operates on the unmapped array of pages. > > It calls new pmap method pmap_copy_pages() to do the data move to and > > from usermode. > > > > Besides not mapped buffers, not mapped BIOs are introduced, marked > > with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated > > to unmapped BIOs. Geom providers may indicate an acceptance of the > > unmapped BIOs. If provider does not handle unmapped i/o requests, > > geom now automatically establishes transient mapping for the i/o > > pages. > > > > Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The > > gpart providers indicate the unmapped BIOs support if the underlying > > provider can do unmapped i/o. I also hacked ahci(4) to handle > > unmapped i/o, but this should be changed after the Jeff' physbio patch > > is committed, to use proper busdma interface. > > > > Besides, the swap pager does unmapped swapping if the swap partition > > indicated that it can do unmapped i/o. By Jeff request, a buffer > > allocation code may reserve the KVA for unmapped buffer in advance. > > The unmapped page-in for the vnode pager is also implemented if > > filesystem supports it, but the page out is not. The page-out, as well > > as the vnode-backed md(4), currently require mappings, mostly due to > > the use of VOP_WRITE(). > > > > As such, the patch worked in my test environment, where I used > > ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no > > statistically significant difference in the buildworld -j 10 times on > > the 4-core machine with HT. On the other hand, when doing sha1 over > > the 5GB file, the system time was reduced by 30%. > > > > Unfinished items: > > - Integration with the physbio, will be done after physbio is > > committed to HEAD. > > - The key per-architecture function needed for the unmapped i/o is the > > pmap_copy_pages(). I implemented it for amd64 and i386 right now, it > > shall be done for all other architectures. > > - The sizing of the submap used for transient mapping of the BIOs is > > naive. Should be adjusted, esp. for KVA-lean architectures. > > - Conversion of the other filesystems. Low priority. > > > > I am interested in reviews, tests and suggestions. Note that this > > only works now for md(4) and ahci(4), for other drivers the patched > > kernel should fall back to the mapped i/o. > > > > > Here are a couple things for you to think about: >=20 > 1. A while back, I developed the patch at > http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in trying to > reduce the number of TLB shootdowns by the buffer map. The idea is simpl= e: > Replace the calls to pmap_q{enter,remove}() with calls to a new > machine-dependent function that opportunistically sets the buffer's kernel > virtual address to the direct map for physically contiguous pages. > However, if the pages are not physically contiguous, it calls pmap_qenter= () > with the kernel virtual address from the buffer map. >=20 > This eliminated about half of the TLB shootdowns for a buildworld, because > there is a decent amount of physical contiguity that occurs by "accident". > Using a buddy allocator for physical page allocation tends to promote this > contiguity. However, in a few places, it occurs by explicit action, e.g., > mapped files, including large executables, using superpage reservations. >=20 > So, how does this fit with what you've done? You might think of using wh= at > I describe above as a kind of "fast path". As you can see from the patch, > it's very simple and non-intrusive. If the pages aren't physically > contiguous, then instead of using pmap_qenter(), you fall back to whatever > approach for creating ephemeral mappings is appropriate to a given > architecture. I remember this. I did not measured the change in the amount of IPIs issued during the buildworld, but I do account for the mapped/unmapped buffer space in the patch. For the buildworld load, there is 5-10% of the mapped buffers =66rom the whole buffers, which coincide with the intuitive size of the metadata for sources. Since unmapped buffers eliminate IPIs at creation and reuse, I safely guess that IPI reduction is on the comparable numbers. The pmap_map_buf() patch is orthohonal to the work I did, and it should nicely reduce the overhead for the metadata buffers handling. I can finish it, if you want. I do not think that it should be added to the already large patch, but instead it could be done and committed separately. >=20 > 2. As for managing the ephemeral mappings on machines that don't support a > direct map. I would suggest an approach that is loosely inspired by > copying garbage collection (or the segment cleaners in log-structured file > systems). Roughly, you manage the buffer map as a few spaces (or > segments). When you create a new mapping in one of these spaces (or > segments), you simply install the PTEs. When you decide to "garbage > collect" a space (or spaces), then you perform a global TLB flush. > Specifically, you do something like toggling the bit in the cr4 register > that enables/disables support for the PG_G bit. If the spaces are > sufficiently large, then the number of such global TLB flushes should be > quite low. Every space would have an epoch number (or flush number). In > the buffer, you would record the epoch number alongside the kernel virtual > address. On access to the buffer, if the epoch number was too old, then > you have to recreate the buffer's mapping in a new space. Could you, please, describe the idea in more details ? For which mappings the described mechanism should be used ? Do you mean the pmap_copy_pages() implementation, or the fallback mappings for BIOs ? Note that pmap_copy_pages() implementaion on i386 is shamelessly stolen =66rom pmap_copy_page() and uses the per-cpu ephemeral mapping for copying. For BIOs, this might be used, but I am also quite satisfied with submap and pmap_qenter(). --5RB7GLe/slk02+tJ Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQ0hVlAAoJEJDCuSvBvK1BAjcP/jdbIaqgddYRu0hD9iVJJTnt dlXYUROtXcfFjTGRwt/l0ZEOadE7nTB/Nbtrwcb6+juYzq95yybUFk2yWjmXb0XX fI8lb/i1SCx3y0w9FA08Nd0cBjxbxGKm7qJmnLuROraYGItPAMwh9sVPkHndBGnD K61/mU+icda2gSAvZ8G+vlrtC3i/1+aC2fGLhyLohM3rwwowEHKMDQxq6gshR4XU CWNJVa1t4RgnE5ZfwbPvsZO5HJvXOVNZ2IhlCTvepwzOVevP6zL/CYY4hxOFTjpN lVgG/qZrK780Ye7qEtjHy8H2eYNBgQbn5rjGdaGUAoXUBI1MRm80namBne+OZqQw 1PCPzRDd+a//QKcPkOPaW6UJOPCg9s4V/tpWFqYUjLRG9TcoU5keGV9TZATqj0k3 D3KGcNBMnoShCsVbOEbE/5JozgDPogmvEI0SV3/ei2zeeJzd2HlyE/bmKI5NANPJ 2WPE1Wnhehu0AJNlFB03WhZ1CqOLt48DUmtrgBRSS/841G6r/F5BbxIOZmneFGNw 5Si40F5l4E+mbKbCmxVDCB4KTdyntLa5uB+8dxAvr/q0H1lSTJJOrjzND+BdrepW G3OtwQ5kVQSwp6iWHvEK4j9b53kTaq3Zl5jlRXJi2nD5aYuYHhQMoSPOykU7i2yB 31NfKZoOyS34FwrrlG/Z =xmx1 -----END PGP SIGNATURE----- --5RB7GLe/slk02+tJ--