From owner-freebsd-arch@FreeBSD.ORG Fri Dec 28 23:20:32 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id AC6AB1BC for ; Fri, 28 Dec 2012 23:20:32 +0000 (UTC) (envelope-from jroberson@jroberson.net) Received: from mail-da0-f49.google.com (mail-da0-f49.google.com [209.85.210.49]) by mx1.freebsd.org (Postfix) with ESMTP id 734358FC0A for ; Fri, 28 Dec 2012 23:20:32 +0000 (UTC) Received: by mail-da0-f49.google.com with SMTP id v40so4951145dad.8 for ; Fri, 28 Dec 2012 15:20:26 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:date:from:x-x-sender:to:cc:subject:in-reply-to :message-id:references:user-agent:mime-version:content-type :x-gm-message-state; bh=/QmLskEBRBC7P2BysEeE8VV7PCsh+xjZehibUXU24Lo=; b=St8nCPW6x7556/uM+DFoVPZz4b4KB87ePGvhDP6dhf66VBGBPYRf2L8lozgqGqyRdu eZyAiYuznnmaJT1YKQzsEHQohWP2KW+D5tQvysSywS3mYt3fvRb1m3oN3p037AuAGpyU gCo3y1NySPgtkv2KaFzvAJQWJBu4Gx9PZxexO5WrhARFW7WzM83JwzzzmLhJ45jy1jhx WNsp18GoquJu83bCFzISLgieP7UDi7Fgt7iqVHk/2tZGWJlYRapGW7/215rK18SwYX9c jbuzdsnEwDG+44gKKZeozdLCrPClxdO+beFIChH/m3fuRPGtYOrkFekcLYp8a9ElLAlz 1R9w== X-Received: by 10.68.241.232 with SMTP id wl8mr107604582pbc.144.1356736826133; Fri, 28 Dec 2012 15:20:26 -0800 (PST) Received: from rrcs-66-91-135-210.west.biz.rr.com (rrcs-66-91-135-210.west.biz.rr.com. [66.91.135.210]) by mx.google.com with ESMTPS id gq10sm20381191pbc.54.2012.12.28.15.20.24 (version=SSLv3 cipher=OTHER); Fri, 28 Dec 2012 15:20:25 -0800 (PST) Date: Fri, 28 Dec 2012 13:20:09 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: Konstantin Belousov Subject: Re: Unmapped I/O In-Reply-To: <20121228230041.GY82219@kib.kiev.ua> Message-ID: References: <20121219135451.GU71906@kib.kiev.ua> <20121228230041.GY82219@kib.kiev.ua> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Gm-Message-State: ALoCoQmBpWQISqGg8Hl6mKplx5TSvnaPIQHg2RoxPPdZdgGoOVxQWA8Dhfc8ff+lcharJMUEEq70 Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2012 23:20:32 -0000 On Sat, 29 Dec 2012, Konstantin Belousov wrote: > On Fri, Dec 28, 2012 at 12:36:57PM -1000, Jeff Roberson wrote: >> On Wed, 19 Dec 2012, Konstantin Belousov wrote: >> >>> One of the known FreeBSD I/O path performance bootleneck is the >>> neccessity to map each I/O buffer pages into KVA. The problem is that >>> on the multi-core machines, the mapping must flush TLB on all cores, >>> due to the global mapping of the buffer pages into the kernel. This >>> means that buffer creation and destruction disrupts execution of all >>> other cores to perform TLB shootdown through IPI, and the thread >>> initiating the shootdown must wait for all other cores to execute and >>> report. >>> >>> The patch at >>> http://people.freebsd.org/~kib/misc/unmapped.4.patch >>> implements the 'unmapped buffers'. It means an ability to create the >>> VMIO struct buf, which does not point to the KVA mapping the buffer >>> pages to the kernel addresses. Since there is no mapping, kernel does >>> not need to clear TLB. The unmapped buffers are marked with the new >>> B_NOTMAPPED flag, and should be requested explicitely using the >>> GB_NOTMAPPED flag to the buffer allocation routines. If the mapped >>> buffer is requested but unmapped buffer already exists, the buffer >>> subsystem automatically maps the pages. >>> >>> The clustering code is also made aware of the not-mapped buffers, but >>> this required the KPI change that accounts for the diff in the non-UFS >>> filesystems. >>> >>> UFS is adopted to request not mapped buffers when kernel does not need >>> to access the content, i.e. mostly for the file data. New helper >>> function vn_io_fault_pgmove() operates on the unmapped array of pages. >>> It calls new pmap method pmap_copy_pages() to do the data move to and >>> from usermode. >>> >>> Besides not mapped buffers, not mapped BIOs are introduced, marked >>> with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated >>> to unmapped BIOs. Geom providers may indicate an acceptance of the >>> unmapped BIOs. If provider does not handle unmapped i/o requests, >>> geom now automatically establishes transient mapping for the i/o >>> pages. >>> >>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The >>> gpart providers indicate the unmapped BIOs support if the underlying >>> provider can do unmapped i/o. I also hacked ahci(4) to handle >>> unmapped i/o, but this should be changed after the Jeff' physbio patch >>> is committed, to use proper busdma interface. >>> >>> Besides, the swap pager does unmapped swapping if the swap partition >>> indicated that it can do unmapped i/o. By Jeff request, a buffer >>> allocation code may reserve the KVA for unmapped buffer in advance. >>> The unmapped page-in for the vnode pager is also implemented if >>> filesystem supports it, but the page out is not. The page-out, as well >>> as the vnode-backed md(4), currently require mappings, mostly due to >>> the use of VOP_WRITE(). >>> >>> As such, the patch worked in my test environment, where I used >>> ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no >>> statistically significant difference in the buildworld -j 10 times on >>> the 4-core machine with HT. On the other hand, when doing sha1 over >>> the 5GB file, the system time was reduced by 30%. >>> >>> Unfinished items: >>> - Integration with the physbio, will be done after physbio is >>> committed to HEAD. >>> - The key per-architecture function needed for the unmapped i/o is the >>> pmap_copy_pages(). I implemented it for amd64 and i386 right now, it >>> shall be done for all other architectures. >>> - The sizing of the submap used for transient mapping of the BIOs is >>> naive. Should be adjusted, esp. for KVA-lean architectures. >>> - Conversion of the other filesystems. Low priority. >>> >>> I am interested in reviews, tests and suggestions. Note that this >>> only works now for md(4) and ahci(4), for other drivers the patched >>> kernel should fall back to the mapped i/o. >>> >>> sys/amd64/amd64/pmap.c | 24 +++ >>> sys/cam/ata/ata_da.c | 5 +- >>> sys/cam/cam_ccb.h | 30 ++++ >>> sys/dev/ahci/ahci.c | 53 +++++- >>> sys/dev/md/md.c | 255 ++++++++++++++++++++++++----- >>> sys/fs/cd9660/cd9660_vnops.c | 2 +- >>> sys/fs/ext2fs/ext2_balloc.c | 2 +- >>> sys/fs/ext2fs/ext2_vnops.c | 9 +- >>> sys/fs/msdosfs/msdosfs_vnops.c | 4 +- >>> sys/fs/udf/udf_vnops.c | 5 +- >>> sys/geom/geom.h | 1 + >>> sys/geom/geom_disk.c | 2 + >>> sys/geom/geom_disk.h | 1 + >>> sys/geom/geom_io.c | 44 ++++- >>> sys/geom/geom_vfs.c | 10 +- >>> sys/geom/part/g_part.c | 1 + >>> sys/i386/i386/pmap.c | 42 +++++ >>> sys/kern/vfs_bio.c | 356 +++++++++++++++++++++++++++++++++-------- >>> sys/kern/vfs_cluster.c | 118 +++++++------- >>> sys/kern/vfs_vnops.c | 39 +++++ >>> sys/sys/bio.h | 7 + >>> sys/sys/buf.h | 22 ++- >>> sys/sys/mount.h | 1 + >>> sys/sys/vnode.h | 2 + >>> sys/ufs/ffs/ffs_alloc.c | 10 +- >>> sys/ufs/ffs/ffs_balloc.c | 58 ++++--- >>> sys/ufs/ffs/ffs_vfsops.c | 3 +- >>> sys/ufs/ffs/ffs_vnops.c | 35 ++-- >>> sys/ufs/ufs/ufs_extern.h | 1 + >>> sys/vm/pmap.h | 2 + >>> sys/vm/swap_pager.c | 43 +++-- >>> sys/vm/swap_pager.h | 1 + >>> sys/vm/vm.h | 2 + >>> sys/vm/vm_init.c | 6 +- >>> sys/vm/vm_kern.c | 9 +- >>> sys/vm/vnode_pager.c | 30 +++- >>> 36 files changed, 989 insertions(+), 246 deletions(-) >>> >>> >> >> A few comments: >> >> 1) If the BIO is mapped at all you have to pass the virtual address to >> busdma. We can not lose the fact that it is mapped somewhere or virtual >> cache architectures will fail. I think when we integrate with physbio we >> can do this properly by implementing a load routine on the bio. > This is weird. How the buffer cache works for such architectures then ? > VMIO buffer pages could be mapped by kernel and any usermode. pmap_remove_write in vfs_busy_pages() handles this. No new writes are allowed until the IO is complete. > > Anyway, I believe I never pass BIO_NOTMAPPED down if the kernel mapping > for the buffer exists. I might document it somewhere. Ok, we just need to be sure at any place where we make the transition from unmapped to mapped we use the right pointer. > > Would you prefer to have both mapping address and vm_page array passed > down, if KVA is available ? Yes, I think that's best. >> >> 2) Why does mdstart_swap() need a cpu_flush_dcache? Shouldn't sf bufs on >> the appropriate architecture do the right flushing when they are unmapped? > I believe it was added by Marcel. sfbuf code cannot not know if the > [d]cache should be flushed, since there is no indication from the caller > on the kind of access to the mapping, ro or rw. Maybe we should change the free function to indicate. This seems sloppy. not your fault though. > >> >> 3) I find the NOTMAPPED negative flag awkward grammatically. UNMAPPED >> seems more natural. Or a positive MAPPED flag would be better. Minor >> concern, bikeshedding, etc. > So you prefers BIO/B_UNMAPPED ? I will change this. I think I'd prefer the positive flag BIO_MAPPED but I recognize that's a lot of work to switch the sense in your entire patch now so feel free to ignore me. > >> >> 4) It would be better to have some wrapper functions around the bio >> transient map and or sf buf handling. I will need it to map unmapped cam >> ccbs in device drivers. We need to come to some agreement on this API. >> There should be a fast page-by-page version and a potentially blocking >> all-at-once linear version. > No, I do not think this is needed. I thought about this, and decided > to handle it centralized in g_down instead to require drivers code > to be changed there. > > The driver should indicate the acceptance of the unmapped BIOs. If it > does not accept them, the patch already contains a code in the g_down > path to create the transient mapping. The PIO-mode ATA drivers should > not set the flag on the disk. There might be some need to be able to > dynamically clear the flag when the channel mode is switched to PIO. > > In fact, I do not see why not to require the always mapped BIOs for > legacy ATA. See DISKFLAG_NOTMAPPED_BIO. I guess this would be acceptable for a first cut but I don't think it will fly long-term. I audited all of the cam and block devices recently. There are cases that are difficult to predict ahead of time which would mean some devices would always be operating in a slower mode. Also my primary test machines are 16core boxes that still use legacy ATA. It is not that uncommon. It would be a shame if many of our users saw no benefit from this. Although for ATA PIO we could make an iterator that used the directmap or sf_bufs. That may be sufficient. It would be nice if it was encapsulated in a simple api so every user didn't have to know the details of pinning, sf bufs, and the memory layout. What would you think of that as the in-between compromise? > >> >> 5) Should the transient bio space not come from the buffer_map? We don't >> have more KVA on 32bit architectures. Or did that come from pbufs? We >> could consolidate here some but there are a lot of potential deadlock >> issues. > No, exactly because it is impossible to request the buffer_map KVA > defragmentation from the g_down. The worst which could happen with the > current code is the pauses during intensive i/o when transient map is > exhausted or fragmented. The g_down is paused until enough in-flight > transactions are completed. > How is there room for this new transient map? Can't pbufs completely give up their space to it now? Shouldn't you always allocate BKVASIZE to avoid fragmentation? And speed-up the search. > Peter has tested it under the load where a lot of the parallel i/o were > issued and transient map was constantly exhausted, for the ad(4) driver > requiring fallback mappings. I added the counters to watch the situation. > >> >> 6) Thank you for adding the KVAALLOC flag. Isilon needs this internally. > Please note that I did not tested this yet :/. > >> >> 7) All of that code that exploded into getblk() should be refactored into >> some support functions. It's hard to read and getblk() is too big >> already. > Agreed. > >> >> All in all it looks like we have the right pieces coming together. Just >> needs a little refactoring and then a lot of test. I'm almost ready to >> commit the first phase of physbio. It doesn't yet have the code necessary >> to temporarily map io for things like ATA PIO. I'm hoping that you'll > It is there in the central place. > >> provide the functions to do that. It does abstract out most of the >> details of the memory formats so we can change them at all. And adds >> support to busdma for physical addresses so bounce pages, alignment, >> sizes, and boundaries work. >> >> Thanks, >> Jeff > Jeff