Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Dec 2012 15:17:35 -0800
From:      Adrian Chadd <adrian@freebsd.org>
To:        Alan Cox <alc@rice.edu>
Cc:        alc@freebsd.org, Konstantin Belousov <kostikbel@gmail.com>, arch@freebsd.org
Subject:   Re: Unmapped I/O
Message-ID:  <CAJ-VmokhnEEuY7OjetyfwdQj0K6B9-dFiO3s6UsEnu6tbB37hw@mail.gmail.com>
In-Reply-To: <50D22EA6.1040501@rice.edu>
References:  <20121219135451.GU71906@kib.kiev.ua> <CAJUyCcNuD_TWR6xxFxVqDi4-eBGx3Jjs21eBxaZYYVUERESbMw@mail.gmail.com> <alpine.BSF.2.00.1212190923170.2005@desktop> <50D22EA6.1040501@rice.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
... some of us are trying to get FreeBSD ready to run on your ARM phone.

Please don't break that. end-goal. :-)


Adrian

On 19 December 2012 13:16, Alan Cox <alc@rice.edu> wrote:
> On 12/19/2012 13:28, Jeff Roberson wrote:
>> On Wed, 19 Dec 2012, Alan Cox wrote:
>>
>>> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov
>>> <kostikbel@gmail.com>wrote:
>>>
>>>> One of the known FreeBSD I/O path performance bootleneck is the
>>>> neccessity to map each I/O buffer pages into KVA.  The problem is that
>>>> on the multi-core machines, the mapping must flush TLB on all cores,
>>>> due to the global mapping of the buffer pages into the kernel.  This
>>>> means that buffer creation and destruction disrupts execution of all
>>>> other cores to perform TLB shootdown through IPI, and the thread
>>>> initiating the shootdown must wait for all other cores to execute and
>>>> report.
>>>>
>>>> The patch at
>>>> http://people.freebsd.org/~kib/misc/unmapped.4.patch
>>>> implements the 'unmapped buffers'.  It means an ability to create the
>>>> VMIO struct buf, which does not point to the KVA mapping the buffer
>>>> pages to the kernel addresses.  Since there is no mapping, kernel does
>>>> not need to clear TLB. The unmapped buffers are marked with the new
>>>> B_NOTMAPPED flag, and should be requested explicitely using the
>>>> GB_NOTMAPPED flag to the buffer allocation routines.  If the mapped
>>>> buffer is requested but unmapped buffer already exists, the buffer
>>>> subsystem automatically maps the pages.
>>>>
>>>> The clustering code is also made aware of the not-mapped buffers, but
>>>> this required the KPI change that accounts for the diff in the non-UFS
>>>> filesystems.
>>>>
>>>> UFS is adopted to request not mapped buffers when kernel does not need
>>>> to access the content, i.e. mostly for the file data.  New helper
>>>> function vn_io_fault_pgmove() operates on the unmapped array of pages.
>>>> It calls new pmap method pmap_copy_pages() to do the data move to and
>>>> from usermode.
>>>>
>>>> Besides not mapped buffers, not mapped BIOs are introduced, marked
>>>> with the flag BIO_NOTMAPPED.  Unmapped buffers are directly translated
>>>> to unmapped BIOs.  Geom providers may indicate an acceptance of the
>>>> unmapped BIOs.  If provider does not handle unmapped i/o requests,
>>>> geom now automatically establishes transient mapping for the i/o
>>>> pages.
>>>>
>>>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The
>>>> gpart providers indicate the unmapped BIOs support if the underlying
>>>> provider can do unmapped i/o.  I also hacked ahci(4) to handle
>>>> unmapped i/o, but this should be changed after the Jeff' physbio patch
>>>> is committed, to use proper busdma interface.
>>>>
>>>> Besides, the swap pager does unmapped swapping if the swap partition
>>>> indicated that it can do unmapped i/o.  By Jeff request, a buffer
>>>> allocation code may reserve the KVA for unmapped buffer in advance.
>>>> The unmapped page-in for the vnode pager is also implemented if
>>>> filesystem supports it, but the page out is not. The page-out, as well
>>>> as the vnode-backed md(4), currently require mappings, mostly due to
>>>> the use of VOP_WRITE().
>>>>
>>>> As such, the patch worked in my test environment, where I used
>>>> ahci-attached SATA disks with gpt partitions, md(4) and UFS.  I see no
>>>> statistically significant difference in the buildworld -j 10 times on
>>>> the 4-core machine with HT.  On the other hand, when doing sha1 over
>>>> the 5GB file, the system time was reduced by 30%.
>>>>
>>>> Unfinished items:
>>>> - Integration with the physbio, will be done after physbio is
>>>>   committed to HEAD.
>>>> - The key per-architecture function needed for the unmapped i/o is the
>>>>   pmap_copy_pages(). I implemented it for amd64 and i386 right now, it
>>>>   shall be done for all other architectures.
>>>> - The sizing of the submap used for transient mapping of the BIOs is
>>>>   naive.  Should be adjusted, esp. for KVA-lean architectures.
>>>> - Conversion of the other filesystems. Low priority.
>>>>
>>>> I am interested in reviews, tests and suggestions.  Note that this
>>>> only works now for md(4) and ahci(4), for other drivers the patched
>>>> kernel should fall back to the mapped i/o.
>>>>
>>>>
>>> Here are a couple things for you to think about:
>>>
>>> 1. A while back, I developed the patch at
>>> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in
>>> trying to
>>> reduce the number of TLB shootdowns by the buffer map.  The idea is
>>> simple:
>>> Replace the calls to pmap_q{enter,remove}() with calls to a new
>>> machine-dependent function that opportunistically sets the buffer's
>>> kernel
>>> virtual address to the direct map for physically contiguous pages.
>>> However, if the pages are not physically contiguous, it calls
>>> pmap_qenter()
>>> with the kernel virtual address from the buffer map.
>>>
>>> This eliminated about half of the TLB shootdowns for a buildworld,
>>> because
>>> there is a decent amount of physical contiguity that occurs by
>>> "accident".
>>> Using a buddy allocator for physical page allocation tends to promote
>>> this
>>> contiguity.  However, in a few places, it occurs by explicit action,
>>> e.g.,
>>> mapped files, including large executables, using superpage reservations.
>>>
>>> So, how does this fit with what you've done?  You might think of
>>> using what
>>> I describe above as a kind of "fast path".  As you can see from the
>>> patch,
>>> it's very simple and non-intrusive.  If the pages aren't physically
>>> contiguous, then instead of using pmap_qenter(), you fall back to
>>> whatever
>>> approach for creating ephemeral mappings is appropriate to a given
>>> architecture.
>>
>> I think these are complimentary.  Kib's patch gives us the fastest
>> possible path for user data.  Alan's patch will improve the metadata
>> performance for things that really require the buffer cache.  I see no
>> reason not to clean up and commit both.
>>
>>>
>>> 2. As for managing the ephemeral mappings on machines that don't
>>> support a
>>> direct map.  I would suggest an approach that is loosely inspired by
>>> copying garbage collection (or the segment cleaners in log-structured
>>> file
>>> systems).  Roughly, you manage the buffer map as a few spaces (or
>>> segments).  When you create a new mapping in one of these spaces (or
>>> segments), you simply install the PTEs.  When you decide to "garbage
>>> collect" a space (or spaces), then you perform a global TLB flush.
>>> Specifically, you do something like toggling the bit in the cr4 register
>>> that enables/disables support for the PG_G bit.  If the spaces are
>>> sufficiently large, then the number of such global TLB flushes should be
>>> quite low.  Every space would have an epoch number (or flush
>>> number).  In
>>> the buffer, you would record the epoch number alongside the kernel
>>> virtual
>>> address.  On access to the buffer, if the epoch number was too old, then
>>> you have to recreate the buffer's mapping in a new space.
>>
>> Are the machines that don't have a direct map performance critical?
>> My expectation is that they are legacy or embedded.  This seems like a
>> great project to do when the rest of the pieces are stable and fast.
>> Until then they could just use something like pbufs?
>>
>
>
> I think the answer to your first question depends entirely on who you
> are.  :-)  Also, at the low-end of the server space, there are many
> people trying to promote arm-based systems.  While FreeBSD may never run
> on your arm-based phone, I think that ceding the arm-based server market
> to others will be a strategic mistake.
>
> Alan
>
> P.S. I think we're moving the discussion to far away from kib's
> original, so I suggest changing the subject line on any follow ups.
>
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-VmokhnEEuY7OjetyfwdQj0K6B9-dFiO3s6UsEnu6tbB37hw>