From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 19 23:17:37 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id BE30A772;
 Wed, 19 Dec 2012 23:17:37 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: from mail-we0-f172.google.com (mail-we0-f172.google.com
 [74.125.82.172])
 by mx1.freebsd.org (Postfix) with ESMTP id 18A8C8FC0A;
 Wed, 19 Dec 2012 23:17:36 +0000 (UTC)
Received: by mail-we0-f172.google.com with SMTP id r3so1267159wey.17
 for <multiple recipients>; Wed, 19 Dec 2012 15:17:35 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date
 :x-google-sender-auth:message-id:subject:from:to:cc:content-type;
 bh=VCoBOgepYIgPaA0Be+jAYPXhIbelvMJt00E4iBNcW6Y=;
 b=kxaT/1qDHHKH8BvvQQagL3LTkTbEAzLC2/ocD0K/5r9jDPzWniLhcPG90UaLuXKEd9
 q9IaChhtiNBjAgKq4obGmDEB+BQsvJHJ+XAphsboDcBDwGuFHGV05QhakN1ReOV3I6RD
 ogVRHGXcfEXmv2q8rgBPfjqMNyh7J/K8dF2RArAFIdlu3rnNpKZt9SeQFEzxBf+45LT5
 fmVSiLL4KQPaWGlmNE2zvlrEgweeDgqlRocO6UNPKV83IIovXqsdUskG+jPipAJARN/T
 uzc1MlF6rSn1vuDVKhT/7bqZ6UuEgrLg6jF5LQsy0D3sJPkANr5uQzOA5c6x7lOq2Etp
 ozfA==
MIME-Version: 1.0
Received: by 10.194.93.40 with SMTP id cr8mr14455368wjb.16.1355959055670; Wed,
 19 Dec 2012 15:17:35 -0800 (PST)
Sender: adrian.chadd@gmail.com
Received: by 10.217.57.9 with HTTP; Wed, 19 Dec 2012 15:17:35 -0800 (PST)
In-Reply-To: <50D22EA6.1040501@rice.edu>
References: <20121219135451.GU71906@kib.kiev.ua>
 <CAJUyCcNuD_TWR6xxFxVqDi4-eBGx3Jjs21eBxaZYYVUERESbMw@mail.gmail.com>
 <alpine.BSF.2.00.1212190923170.2005@desktop>
 <50D22EA6.1040501@rice.edu>
Date: Wed, 19 Dec 2012 15:17:35 -0800
X-Google-Sender-Auth: 3qIopfcW2Mi5NBNl1fg7OkpLduI
Message-ID: <CAJ-VmokhnEEuY7OjetyfwdQj0K6B9-dFiO3s6UsEnu6tbB37hw@mail.gmail.com>
Subject: Re: Unmapped I/O
From: Adrian Chadd <adrian@freebsd.org>
To: Alan Cox <alc@rice.edu>
Content-Type: text/plain; charset=ISO-8859-1
Cc: alc@freebsd.org, Konstantin Belousov <kostikbel@gmail.com>,
 arch@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 19 Dec 2012 23:17:37 -0000

... some of us are trying to get FreeBSD ready to run on your ARM phone.

Please don't break that. end-goal. :-)


Adrian

On 19 December 2012 13:16, Alan Cox <alc@rice.edu> wrote:
> On 12/19/2012 13:28, Jeff Roberson wrote:
>> On Wed, 19 Dec 2012, Alan Cox wrote:
>>
>>> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov
>>> <kostikbel@gmail.com>wrote:
>>>
>>>> One of the known FreeBSD I/O path performance bootleneck is the
>>>> neccessity to map each I/O buffer pages into KVA.  The problem is that
>>>> on the multi-core machines, the mapping must flush TLB on all cores,
>>>> due to the global mapping of the buffer pages into the kernel.  This
>>>> means that buffer creation and destruction disrupts execution of all
>>>> other cores to perform TLB shootdown through IPI, and the thread
>>>> initiating the shootdown must wait for all other cores to execute and
>>>> report.
>>>>
>>>> The patch at
>>>> http://people.freebsd.org/~kib/misc/unmapped.4.patch
>>>> implements the 'unmapped buffers'.  It means an ability to create the
>>>> VMIO struct buf, which does not point to the KVA mapping the buffer
>>>> pages to the kernel addresses.  Since there is no mapping, kernel does
>>>> not need to clear TLB. The unmapped buffers are marked with the new
>>>> B_NOTMAPPED flag, and should be requested explicitely using the
>>>> GB_NOTMAPPED flag to the buffer allocation routines.  If the mapped
>>>> buffer is requested but unmapped buffer already exists, the buffer
>>>> subsystem automatically maps the pages.
>>>>
>>>> The clustering code is also made aware of the not-mapped buffers, but
>>>> this required the KPI change that accounts for the diff in the non-UFS
>>>> filesystems.
>>>>
>>>> UFS is adopted to request not mapped buffers when kernel does not need
>>>> to access the content, i.e. mostly for the file data.  New helper
>>>> function vn_io_fault_pgmove() operates on the unmapped array of pages.
>>>> It calls new pmap method pmap_copy_pages() to do the data move to and
>>>> from usermode.
>>>>
>>>> Besides not mapped buffers, not mapped BIOs are introduced, marked
>>>> with the flag BIO_NOTMAPPED.  Unmapped buffers are directly translated
>>>> to unmapped BIOs.  Geom providers may indicate an acceptance of the
>>>> unmapped BIOs.  If provider does not handle unmapped i/o requests,
>>>> geom now automatically establishes transient mapping for the i/o
>>>> pages.
>>>>
>>>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The
>>>> gpart providers indicate the unmapped BIOs support if the underlying
>>>> provider can do unmapped i/o.  I also hacked ahci(4) to handle
>>>> unmapped i/o, but this should be changed after the Jeff' physbio patch
>>>> is committed, to use proper busdma interface.
>>>>
>>>> Besides, the swap pager does unmapped swapping if the swap partition
>>>> indicated that it can do unmapped i/o.  By Jeff request, a buffer
>>>> allocation code may reserve the KVA for unmapped buffer in advance.
>>>> The unmapped page-in for the vnode pager is also implemented if
>>>> filesystem supports it, but the page out is not. The page-out, as well
>>>> as the vnode-backed md(4), currently require mappings, mostly due to
>>>> the use of VOP_WRITE().
>>>>
>>>> As such, the patch worked in my test environment, where I used
>>>> ahci-attached SATA disks with gpt partitions, md(4) and UFS.  I see no
>>>> statistically significant difference in the buildworld -j 10 times on
>>>> the 4-core machine with HT.  On the other hand, when doing sha1 over
>>>> the 5GB file, the system time was reduced by 30%.
>>>>
>>>> Unfinished items:
>>>> - Integration with the physbio, will be done after physbio is
>>>>   committed to HEAD.
>>>> - The key per-architecture function needed for the unmapped i/o is the
>>>>   pmap_copy_pages(). I implemented it for amd64 and i386 right now, it
>>>>   shall be done for all other architectures.
>>>> - The sizing of the submap used for transient mapping of the BIOs is
>>>>   naive.  Should be adjusted, esp. for KVA-lean architectures.
>>>> - Conversion of the other filesystems. Low priority.
>>>>
>>>> I am interested in reviews, tests and suggestions.  Note that this
>>>> only works now for md(4) and ahci(4), for other drivers the patched
>>>> kernel should fall back to the mapped i/o.
>>>>
>>>>
>>> Here are a couple things for you to think about:
>>>
>>> 1. A while back, I developed the patch at
>>> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in
>>> trying to
>>> reduce the number of TLB shootdowns by the buffer map.  The idea is
>>> simple:
>>> Replace the calls to pmap_q{enter,remove}() with calls to a new
>>> machine-dependent function that opportunistically sets the buffer's
>>> kernel
>>> virtual address to the direct map for physically contiguous pages.
>>> However, if the pages are not physically contiguous, it calls
>>> pmap_qenter()
>>> with the kernel virtual address from the buffer map.
>>>
>>> This eliminated about half of the TLB shootdowns for a buildworld,
>>> because
>>> there is a decent amount of physical contiguity that occurs by
>>> "accident".
>>> Using a buddy allocator for physical page allocation tends to promote
>>> this
>>> contiguity.  However, in a few places, it occurs by explicit action,
>>> e.g.,
>>> mapped files, including large executables, using superpage reservations.
>>>
>>> So, how does this fit with what you've done?  You might think of
>>> using what
>>> I describe above as a kind of "fast path".  As you can see from the
>>> patch,
>>> it's very simple and non-intrusive.  If the pages aren't physically
>>> contiguous, then instead of using pmap_qenter(), you fall back to
>>> whatever
>>> approach for creating ephemeral mappings is appropriate to a given
>>> architecture.
>>
>> I think these are complimentary.  Kib's patch gives us the fastest
>> possible path for user data.  Alan's patch will improve the metadata
>> performance for things that really require the buffer cache.  I see no
>> reason not to clean up and commit both.
>>
>>>
>>> 2. As for managing the ephemeral mappings on machines that don't
>>> support a
>>> direct map.  I would suggest an approach that is loosely inspired by
>>> copying garbage collection (or the segment cleaners in log-structured
>>> file
>>> systems).  Roughly, you manage the buffer map as a few spaces (or
>>> segments).  When you create a new mapping in one of these spaces (or
>>> segments), you simply install the PTEs.  When you decide to "garbage
>>> collect" a space (or spaces), then you perform a global TLB flush.
>>> Specifically, you do something like toggling the bit in the cr4 register
>>> that enables/disables support for the PG_G bit.  If the spaces are
>>> sufficiently large, then the number of such global TLB flushes should be
>>> quite low.  Every space would have an epoch number (or flush
>>> number).  In
>>> the buffer, you would record the epoch number alongside the kernel
>>> virtual
>>> address.  On access to the buffer, if the epoch number was too old, then
>>> you have to recreate the buffer's mapping in a new space.
>>
>> Are the machines that don't have a direct map performance critical?
>> My expectation is that they are legacy or embedded.  This seems like a
>> great project to do when the rest of the pieces are stable and fast.
>> Until then they could just use something like pbufs?
>>
>
>
> I think the answer to your first question depends entirely on who you
> are.  :-)  Also, at the low-end of the server space, there are many
> people trying to promote arm-based systems.  While FreeBSD may never run
> on your arm-based phone, I think that ceding the arm-based server market
> to others will be a strategic mistake.
>
> Alan
>
> P.S. I think we're moving the discussion to far away from kib's
> original, so I suggest changing the subject line on any follow ups.
>
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"