From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 23:00:52 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 20F08C26 for ; Fri, 28 Dec 2012 23:00:52 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 3D38A8FC15 for ; Fri, 28 Dec 2012 23:00:51 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBSN0faF010190; Sat, 29 Dec 2012 01:00:41 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBSN0faF010190 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBSN0fp4010189; Sat, 29 Dec 2012 01:00:41 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 29 Dec 2012 01:00:41 +0200 From: Konstantin Belousov To: Jeff Roberson Subject: Re: Unmapped I/O Message-ID: <20121228230041.GY82219@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="eUqGrSc0O7wKBRnC" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2012 23:00:52 -0000 --eUqGrSc0O7wKBRnC Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Dec 28, 2012 at 12:36:57PM -1000, Jeff Roberson wrote: > On Wed, 19 Dec 2012, Konstantin Belousov wrote: >=20 > > One of the known FreeBSD I/O path performance bootleneck is the > > neccessity to map each I/O buffer pages into KVA. The problem is that > > on the multi-core machines, the mapping must flush TLB on all cores, > > due to the global mapping of the buffer pages into the kernel. This > > means that buffer creation and destruction disrupts execution of all > > other cores to perform TLB shootdown through IPI, and the thread > > initiating the shootdown must wait for all other cores to execute and > > report. > > > > The patch at > > http://people.freebsd.org/~kib/misc/unmapped.4.patch > > implements the 'unmapped buffers'. It means an ability to create the > > VMIO struct buf, which does not point to the KVA mapping the buffer > > pages to the kernel addresses. Since there is no mapping, kernel does > > not need to clear TLB. The unmapped buffers are marked with the new > > B_NOTMAPPED flag, and should be requested explicitely using the > > GB_NOTMAPPED flag to the buffer allocation routines. If the mapped > > buffer is requested but unmapped buffer already exists, the buffer > > subsystem automatically maps the pages. > > > > The clustering code is also made aware of the not-mapped buffers, but > > this required the KPI change that accounts for the diff in the non-UFS > > filesystems. > > > > UFS is adopted to request not mapped buffers when kernel does not need > > to access the content, i.e. mostly for the file data. New helper > > function vn_io_fault_pgmove() operates on the unmapped array of pages. > > It calls new pmap method pmap_copy_pages() to do the data move to and > > from usermode. > > > > Besides not mapped buffers, not mapped BIOs are introduced, marked > > with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated > > to unmapped BIOs. Geom providers may indicate an acceptance of the > > unmapped BIOs. If provider does not handle unmapped i/o requests, > > geom now automatically establishes transient mapping for the i/o > > pages. > > > > Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The > > gpart providers indicate the unmapped BIOs support if the underlying > > provider can do unmapped i/o. I also hacked ahci(4) to handle > > unmapped i/o, but this should be changed after the Jeff' physbio patch > > is committed, to use proper busdma interface. > > > > Besides, the swap pager does unmapped swapping if the swap partition > > indicated that it can do unmapped i/o. By Jeff request, a buffer > > allocation code may reserve the KVA for unmapped buffer in advance. > > The unmapped page-in for the vnode pager is also implemented if > > filesystem supports it, but the page out is not. The page-out, as well > > as the vnode-backed md(4), currently require mappings, mostly due to > > the use of VOP_WRITE(). > > > > As such, the patch worked in my test environment, where I used > > ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no > > statistically significant difference in the buildworld -j 10 times on > > the 4-core machine with HT. On the other hand, when doing sha1 over > > the 5GB file, the system time was reduced by 30%. > > > > Unfinished items: > > - Integration with the physbio, will be done after physbio is > > committed to HEAD. > > - The key per-architecture function needed for the unmapped i/o is the > > pmap_copy_pages(). I implemented it for amd64 and i386 right now, it > > shall be done for all other architectures. > > - The sizing of the submap used for transient mapping of the BIOs is > > naive. Should be adjusted, esp. for KVA-lean architectures. > > - Conversion of the other filesystems. Low priority. > > > > I am interested in reviews, tests and suggestions. Note that this > > only works now for md(4) and ahci(4), for other drivers the patched > > kernel should fall back to the mapped i/o. > > > > sys/amd64/amd64/pmap.c | 24 +++ > > sys/cam/ata/ata_da.c | 5 +- > > sys/cam/cam_ccb.h | 30 ++++ > > sys/dev/ahci/ahci.c | 53 +++++- > > sys/dev/md/md.c | 255 ++++++++++++++++++++++++----- > > sys/fs/cd9660/cd9660_vnops.c | 2 +- > > sys/fs/ext2fs/ext2_balloc.c | 2 +- > > sys/fs/ext2fs/ext2_vnops.c | 9 +- > > sys/fs/msdosfs/msdosfs_vnops.c | 4 +- > > sys/fs/udf/udf_vnops.c | 5 +- > > sys/geom/geom.h | 1 + > > sys/geom/geom_disk.c | 2 + > > sys/geom/geom_disk.h | 1 + > > sys/geom/geom_io.c | 44 ++++- > > sys/geom/geom_vfs.c | 10 +- > > sys/geom/part/g_part.c | 1 + > > sys/i386/i386/pmap.c | 42 +++++ > > sys/kern/vfs_bio.c | 356 +++++++++++++++++++++++++++++++++-= ------- > > sys/kern/vfs_cluster.c | 118 +++++++------- > > sys/kern/vfs_vnops.c | 39 +++++ > > sys/sys/bio.h | 7 + > > sys/sys/buf.h | 22 ++- > > sys/sys/mount.h | 1 + > > sys/sys/vnode.h | 2 + > > sys/ufs/ffs/ffs_alloc.c | 10 +- > > sys/ufs/ffs/ffs_balloc.c | 58 ++++--- > > sys/ufs/ffs/ffs_vfsops.c | 3 +- > > sys/ufs/ffs/ffs_vnops.c | 35 ++-- > > sys/ufs/ufs/ufs_extern.h | 1 + > > sys/vm/pmap.h | 2 + > > sys/vm/swap_pager.c | 43 +++-- > > sys/vm/swap_pager.h | 1 + > > sys/vm/vm.h | 2 + > > sys/vm/vm_init.c | 6 +- > > sys/vm/vm_kern.c | 9 +- > > sys/vm/vnode_pager.c | 30 +++- > > 36 files changed, 989 insertions(+), 246 deletions(-) > > > > >=20 > A few comments: >=20 > 1) If the BIO is mapped at all you have to pass the virtual address to= =20 > busdma. We can not lose the fact that it is mapped somewhere or virtual= =20 > cache architectures will fail. I think when we integrate with physbio we= =20 > can do this properly by implementing a load routine on the bio. This is weird. How the buffer cache works for such architectures then ? VMIO buffer pages could be mapped by kernel and any usermode. Anyway, I believe I never pass BIO_NOTMAPPED down if the kernel mapping for the buffer exists. I might document it somewhere. Would you prefer to have both mapping address and vm_page array passed down, if KVA is available ? >=20 > 2) Why does mdstart_swap() need a cpu_flush_dcache? Shouldn't sf bufs o= n=20 > the appropriate architecture do the right flushing when they are unmapped? I believe it was added by Marcel. sfbuf code cannot not know if the [d]cache should be flushed, since there is no indication from the caller on the kind of access to the mapping, ro or rw. =20 >=20 > 3) I find the NOTMAPPED negative flag awkward grammatically. UNMAPPED= =20 > seems more natural. Or a positive MAPPED flag would be better. Minor=20 > concern, bikeshedding, etc. So you prefers BIO/B_UNMAPPED ? I will change this. >=20 > 4) It would be better to have some wrapper functions around the bio=20 > transient map and or sf buf handling. I will need it to map unmapped cam= =20 > ccbs in device drivers. We need to come to some agreement on this API.= =20 > There should be a fast page-by-page version and a potentially blocking=20 > all-at-once linear version. No, I do not think this is needed. I thought about this, and decided to handle it centralized in g_down instead to require drivers code to be changed there. The driver should indicate the acceptance of the unmapped BIOs. If it does not accept them, the patch already contains a code in the g_down path to create the transient mapping. The PIO-mode ATA drivers should not set the flag on the disk. There might be some need to be able to dynamically clear the flag when the channel mode is switched to PIO. In fact, I do not see why not to require the always mapped BIOs for legacy ATA. See DISKFLAG_NOTMAPPED_BIO. >=20 > 5) Should the transient bio space not come from the buffer_map? We don'= t=20 > have more KVA on 32bit architectures. Or did that come from pbufs? We= =20 > could consolidate here some but there are a lot of potential deadlock=20 > issues. No, exactly because it is impossible to request the buffer_map KVA defragmentation from the g_down. The worst which could happen with the current code is the pauses during intensive i/o when transient map is exhausted or fragmented. The g_down is paused until enough in-flight transactions are completed. Peter has tested it under the load where a lot of the parallel i/o were issued and transient map was constantly exhausted, for the ad(4) driver requiring fallback mappings. I added the counters to watch the situation. >=20 > 6) Thank you for adding the KVAALLOC flag. Isilon needs this internally. Please note that I did not tested this yet :/. >=20 > 7) All of that code that exploded into getblk() should be refactored int= o=20 > some support functions. It's hard to read and getblk() is too big=20 > already. Agreed. >=20 > All in all it looks like we have the right pieces coming together. Just= =20 > needs a little refactoring and then a lot of test. I'm almost ready to= =20 > commit the first phase of physbio. It doesn't yet have the code necessar= y=20 > to temporarily map io for things like ATA PIO. I'm hoping that you'll=20 It is there in the central place. > provide the functions to do that. It does abstract out most of the=20 > details of the memory formats so we can change them at all. And adds=20 > support to busdma for physical addresses so bounce pages, alignment,=20 > sizes, and boundaries work. >=20 > Thanks, > Jeff --eUqGrSc0O7wKBRnC Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQ3iSYAAoJEJDCuSvBvK1BRZ0P/2TQph/5zzHD45+UCtQucthS IdJIr0iN6IpLj8++Xs3iR4BAHpFWhKmmqlhxmomOUQSLtHqL3kmqBbdxhNltof/E +OszxMkCqcSF4utLhiYKo/tTePyNPZ0yKaMfX7FpUX/8sa6nRxeMmsub6BhPmLiu lu76UlpIHKJRqgHSOuOq1SNVcSFdQGu6m7zZ6SP2ZuhE7JfJvmJyvCKZqEzwo2H/ yUPRTTarNJsOLEMoQOaC0GtPRAWtjNICEwoCmuqU4QdEVy/PKnVKX+W4qDKCZnXK O1jmLpn2DrVWRpq2oOURnVcC4a+7XG7ilx5IymNEDZr6f+KLrrbYOhTNiYNw0Kw/ ZVRIsHbsNu9YKt4VqZiX+5Yz38QmGR/QsQuWLaXP57140jkTb0v65EJ7pgZQegyf po5hJH+GFZP5vfm0FN+Co5jt2GGD0XiZUJkILgEWUuhsh3bld698lrKDVOKI4eMS sBRlJj/ngf/NmbhZLKcTeG/qK4823RITWotPmnmBiwa135fNgWNEG/WfkMBZKwUE zl2Sp1HZk5EI7HwTsRJdZX8JxfB4ulZrveIXHuO0LrprbprbRbU/5uo3iggaTF94 FNW8Qjnf4QaMTex/e8jW2wVgjYsGbQTi3N4UiAZp+hfk5vrpQ2KH2mISdiMNyzIU dO1rqwxLLZ6r9GJL0Pal =5ibS -----END PGP SIGNATURE----- --eUqGrSc0O7wKBRnC--