From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 23:17:37 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BE30A772; Wed, 19 Dec 2012 23:17:37 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-we0-f172.google.com (mail-we0-f172.google.com [74.125.82.172]) by mx1.freebsd.org (Postfix) with ESMTP id 18A8C8FC0A; Wed, 19 Dec 2012 23:17:36 +0000 (UTC) Received: by mail-we0-f172.google.com with SMTP id r3so1267159wey.17 for ; Wed, 19 Dec 2012 15:17:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=VCoBOgepYIgPaA0Be+jAYPXhIbelvMJt00E4iBNcW6Y=; b=kxaT/1qDHHKH8BvvQQagL3LTkTbEAzLC2/ocD0K/5r9jDPzWniLhcPG90UaLuXKEd9 q9IaChhtiNBjAgKq4obGmDEB+BQsvJHJ+XAphsboDcBDwGuFHGV05QhakN1ReOV3I6RD ogVRHGXcfEXmv2q8rgBPfjqMNyh7J/K8dF2RArAFIdlu3rnNpKZt9SeQFEzxBf+45LT5 fmVSiLL4KQPaWGlmNE2zvlrEgweeDgqlRocO6UNPKV83IIovXqsdUskG+jPipAJARN/T uzc1MlF6rSn1vuDVKhT/7bqZ6UuEgrLg6jF5LQsy0D3sJPkANr5uQzOA5c6x7lOq2Etp ozfA== MIME-Version: 1.0 Received: by 10.194.93.40 with SMTP id cr8mr14455368wjb.16.1355959055670; Wed, 19 Dec 2012 15:17:35 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.217.57.9 with HTTP; Wed, 19 Dec 2012 15:17:35 -0800 (PST) In-Reply-To: <50D22EA6.1040501@rice.edu> References: <20121219135451.GU71906@kib.kiev.ua> <50D22EA6.1040501@rice.edu> Date: Wed, 19 Dec 2012 15:17:35 -0800 X-Google-Sender-Auth: 3qIopfcW2Mi5NBNl1fg7OkpLduI Message-ID: Subject: Re: Unmapped I/O From: Adrian Chadd To: Alan Cox Content-Type: text/plain; charset=ISO-8859-1 Cc: alc@freebsd.org, Konstantin Belousov , arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 23:17:37 -0000 ... some of us are trying to get FreeBSD ready to run on your ARM phone. Please don't break that. end-goal. :-) Adrian On 19 December 2012 13:16, Alan Cox wrote: > On 12/19/2012 13:28, Jeff Roberson wrote: >> On Wed, 19 Dec 2012, Alan Cox wrote: >> >>> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov >>> wrote: >>> >>>> One of the known FreeBSD I/O path performance bootleneck is the >>>> neccessity to map each I/O buffer pages into KVA. The problem is that >>>> on the multi-core machines, the mapping must flush TLB on all cores, >>>> due to the global mapping of the buffer pages into the kernel. This >>>> means that buffer creation and destruction disrupts execution of all >>>> other cores to perform TLB shootdown through IPI, and the thread >>>> initiating the shootdown must wait for all other cores to execute and >>>> report. >>>> >>>> The patch at >>>> http://people.freebsd.org/~kib/misc/unmapped.4.patch >>>> implements the 'unmapped buffers'. It means an ability to create the >>>> VMIO struct buf, which does not point to the KVA mapping the buffer >>>> pages to the kernel addresses. Since there is no mapping, kernel does >>>> not need to clear TLB. The unmapped buffers are marked with the new >>>> B_NOTMAPPED flag, and should be requested explicitely using the >>>> GB_NOTMAPPED flag to the buffer allocation routines. If the mapped >>>> buffer is requested but unmapped buffer already exists, the buffer >>>> subsystem automatically maps the pages. >>>> >>>> The clustering code is also made aware of the not-mapped buffers, but >>>> this required the KPI change that accounts for the diff in the non-UFS >>>> filesystems. >>>> >>>> UFS is adopted to request not mapped buffers when kernel does not need >>>> to access the content, i.e. mostly for the file data. New helper >>>> function vn_io_fault_pgmove() operates on the unmapped array of pages. >>>> It calls new pmap method pmap_copy_pages() to do the data move to and >>>> from usermode. >>>> >>>> Besides not mapped buffers, not mapped BIOs are introduced, marked >>>> with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated >>>> to unmapped BIOs. Geom providers may indicate an acceptance of the >>>> unmapped BIOs. If provider does not handle unmapped i/o requests, >>>> geom now automatically establishes transient mapping for the i/o >>>> pages. >>>> >>>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The >>>> gpart providers indicate the unmapped BIOs support if the underlying >>>> provider can do unmapped i/o. I also hacked ahci(4) to handle >>>> unmapped i/o, but this should be changed after the Jeff' physbio patch >>>> is committed, to use proper busdma interface. >>>> >>>> Besides, the swap pager does unmapped swapping if the swap partition >>>> indicated that it can do unmapped i/o. By Jeff request, a buffer >>>> allocation code may reserve the KVA for unmapped buffer in advance. >>>> The unmapped page-in for the vnode pager is also implemented if >>>> filesystem supports it, but the page out is not. The page-out, as well >>>> as the vnode-backed md(4), currently require mappings, mostly due to >>>> the use of VOP_WRITE(). >>>> >>>> As such, the patch worked in my test environment, where I used >>>> ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no >>>> statistically significant difference in the buildworld -j 10 times on >>>> the 4-core machine with HT. On the other hand, when doing sha1 over >>>> the 5GB file, the system time was reduced by 30%. >>>> >>>> Unfinished items: >>>> - Integration with the physbio, will be done after physbio is >>>> committed to HEAD. >>>> - The key per-architecture function needed for the unmapped i/o is the >>>> pmap_copy_pages(). I implemented it for amd64 and i386 right now, it >>>> shall be done for all other architectures. >>>> - The sizing of the submap used for transient mapping of the BIOs is >>>> naive. Should be adjusted, esp. for KVA-lean architectures. >>>> - Conversion of the other filesystems. Low priority. >>>> >>>> I am interested in reviews, tests and suggestions. Note that this >>>> only works now for md(4) and ahci(4), for other drivers the patched >>>> kernel should fall back to the mapped i/o. >>>> >>>> >>> Here are a couple things for you to think about: >>> >>> 1. A while back, I developed the patch at >>> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in >>> trying to >>> reduce the number of TLB shootdowns by the buffer map. The idea is >>> simple: >>> Replace the calls to pmap_q{enter,remove}() with calls to a new >>> machine-dependent function that opportunistically sets the buffer's >>> kernel >>> virtual address to the direct map for physically contiguous pages. >>> However, if the pages are not physically contiguous, it calls >>> pmap_qenter() >>> with the kernel virtual address from the buffer map. >>> >>> This eliminated about half of the TLB shootdowns for a buildworld, >>> because >>> there is a decent amount of physical contiguity that occurs by >>> "accident". >>> Using a buddy allocator for physical page allocation tends to promote >>> this >>> contiguity. However, in a few places, it occurs by explicit action, >>> e.g., >>> mapped files, including large executables, using superpage reservations. >>> >>> So, how does this fit with what you've done? You might think of >>> using what >>> I describe above as a kind of "fast path". As you can see from the >>> patch, >>> it's very simple and non-intrusive. If the pages aren't physically >>> contiguous, then instead of using pmap_qenter(), you fall back to >>> whatever >>> approach for creating ephemeral mappings is appropriate to a given >>> architecture. >> >> I think these are complimentary. Kib's patch gives us the fastest >> possible path for user data. Alan's patch will improve the metadata >> performance for things that really require the buffer cache. I see no >> reason not to clean up and commit both. >> >>> >>> 2. As for managing the ephemeral mappings on machines that don't >>> support a >>> direct map. I would suggest an approach that is loosely inspired by >>> copying garbage collection (or the segment cleaners in log-structured >>> file >>> systems). Roughly, you manage the buffer map as a few spaces (or >>> segments). When you create a new mapping in one of these spaces (or >>> segments), you simply install the PTEs. When you decide to "garbage >>> collect" a space (or spaces), then you perform a global TLB flush. >>> Specifically, you do something like toggling the bit in the cr4 register >>> that enables/disables support for the PG_G bit. If the spaces are >>> sufficiently large, then the number of such global TLB flushes should be >>> quite low. Every space would have an epoch number (or flush >>> number). In >>> the buffer, you would record the epoch number alongside the kernel >>> virtual >>> address. On access to the buffer, if the epoch number was too old, then >>> you have to recreate the buffer's mapping in a new space. >> >> Are the machines that don't have a direct map performance critical? >> My expectation is that they are legacy or embedded. This seems like a >> great project to do when the rest of the pieces are stable and fast. >> Until then they could just use something like pbufs? >> > > > I think the answer to your first question depends entirely on who you > are. :-) Also, at the low-end of the server space, there are many > people trying to promote arm-based systems. While FreeBSD may never run > on your arm-based phone, I think that ceding the arm-based server market > to others will be a strategic mistake. > > Alan > > P.S. I think we're moving the discussion to far away from kib's > original, so I suggest changing the subject line on any follow ups. > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"