Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Nov 2009 14:27:23 -0500
From:      John Baldwin <jhb@freebsd.org>
To:        Scott Long <scottl@samsco.org>
Cc:        Attilio Rao <attilio@freebsd.org>, arch@freebsd.org
Subject:   Re: sglist(9)
Message-ID:  <200911301427.23166.jhb@freebsd.org>
In-Reply-To: <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org>
References:  <200905191458.50764.jhb@freebsd.org> <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com> <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sunday 29 November 2009 6:27:49 pm Scott Long wrote:
> > On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote:
> >> On Tue, 19 May 2009, John Baldwin wrote:
> >>
> >> 2) I worry that if all users do sglist_count() followed by a dynamic
> >> allocation and then an _append() they will be very expensive.
> >> pmap_kextract() is much more expensive than it may first seem to  
> >> be.  Do
> >> you have a user of count already?
> >
> > The only one that does now is sglist_build() and nothing currently  
> > uses that.
> 
> Kinda silly to have it then.  I also don't see the point of it; if the  
> point of the sglist object is to avoid VA mappings, then why start  
> with a VA mapping?  But that aside, Jeff is correct, sglist_build() is  
> horribly inefficient.

It actually does get used by the nvidia driver, but so far in what I have
done in my jhb_bio branch I have tried several different approaches which
is why the API is as verbose as it is.

> > VOP_GET/PUTPAGES would not need to do this since they could simply  
> > append
> > the physical addresses extracted directly from vm_page_t's for  
> > example.  I'm
> > not sure this will be used very much now as originally I thought I  
> > would be
> > changing all storage drivers to do all DMA operations using sglists  
> > and this
> > sort of thing would have been used for non-bio requests like firmware
> > commands; however, as expounded on below, it actually appears better  
> > to
> > still treat bio's separate from non-bio requests for bus_dma so that  
> > the
> > non-bio requests can continue to use bus_dmamap_load_buffer() as  
> > they do
> > now.
> >
> 
> I completely disagree with your approach to busdma, but I'll get into  
> that later.  What I really don't understand is, why have yet another  
> page description format?  Whether the I/O is disk buffers being pushed  
> by a pager to a disk controller, or texture buffers being pushed to  
> video hardware, they already have vm objects associated with them,  
> no?  Why translate to an intermediate format?  I understand that  
> you're creating another vm object format to deal directly with the  
> needs of nvidia, but that's really a one-off case right now.  What  
> about when we want the pagedaemon to push unmapped i/o?  Will it have  
> to spend cycles translating its objects to sglists?  This is really a  
> larger question that I'm not prepared to answer, but would like to  
> discuss.

Textures do not already have objects associated, no.  However, I did not
design sglist with Nvidia in mind.  I hacked on it in conjunction with
phk@, gibbs@, and jeff@ to work on unmapped bio support.  That was the
only motivation for sglist(9).  Only the OBJT_SG bits were specific to
Nvidia and that was added as an afterthought because sglist(9) already
existed in the jhb_bio branch.

If you look at GETPAGES/PUTPAGES they already deal in terms of vm_page_t's,
not VM objects, and vm_page_t's already provide a linear time way of fetching
the physical address (m->phys_addr), so generating an sglist for GETPAGES and
PUTPAGES will be very cheap.  One of the original proposals from phk@ was
actually to pass around arrays of vm_page_t's to describe I/O buffers.  When
Poul, Peter, and I talked about it we figured we had a choice between passing
either the physical address or the vm_page_t.  However, not all physical
addresses have vm_page_t's, and it was deemed that GEOM's and disk drivers did
not need any properties of the vm_page_t aside from the physical address.  For
those reasons, sglist(9) uses physical addresses.

> >> In general I think this is a good idea.  It'd be nice to work on  
> >> replacing
> >> the buf layer's implementation with something like this that could  
> >> be used
> >> directly by drivers.  Have you considered a busdma operation to  
> >> load from
> >> a sglist?
> >
> > So in regards to the bus_dma stuff, I did work on this a while ago  
> > in my
> > jhb_bio branch.  I do have a bus_dmamap_load_sglist() and I had  
> > planned on
> > using that in storage drivers directly.  However, I ended up  
> > circling back
> > to preferring a bus_dmamap_load_bio() and adding a new 'bio_start'  
> > field
> > to 'struct bio' that is an offset into an attached sglist.
> 
> I strongly disagree with forcing busdma to have intimate knowledge of  
> bio's.  All of the needed information can be stored in sglist headers.

The alternative is to teach every disk driver to handle the difference
as well as every GEOM module.  Not only that, but it doesn't provide any
easy transition path since you can't change any of the top-level code
to use an unmapped bio at until all the lower levels have been converted
to handle both ways.  Jeff had originally proposed having a
bus_dmamap_load_bio() and I tried to not use it but just have a
bus_dmamap_load_sglist() instead, but when I started looking at the extra
work that would have to be duplicated in every driver to handle both types
of bios, I concluded bus_dmamap_load_bio() would actually be a lot simpler.
 
> >  This let me
> > carve up I/O requests in geom_dev to satisfy a disk device's max  
> > request
> > size while still sharing the same read-only sglist across the various
> > BIO's (by simply adjusting bio_length and bio_start to be a subrange  
> > of
> > the sglist) as opposed to doing memory allocations to allocate  
> > specific
> > ranges of an sglist (using something like sglist_slice()) for each I/O
> > request.
> 
> I think this is fundamentally wrong.  You're proposing exchanging a  
> cheap operation of splitting VA's with an expensive operation of  
> allocating, splitting, copying, and refcounting sglists.  Splitting is  
> an excessively common operation, and your proposal will impact  
> performance as storage becomes exponentially faster.

The whole point is to not do anything to the sglist when splitting up requests 
so that it is more efficient.  I wrote above that splitting up the sglist 
would require allocations and be slow, so I specifically avoided that.  
Instead, one just does a simple refcount bump (refcount_acquire()) when 
cloning the bio (which is already doing an allocation to get the new bio) and 
one does a simple 'bio->bio_start += X' where one already does
bio->bio_data += X' now.

Instead, the sglist describes the "large" buffer at the "top" of the
I/O request tree, and when you split up the large bio into smaller ones
you simply use bio_start and bio_length to specify the sub-range of the
buffer.

> We need to stop thinking about maxio as a roadbump at the bottom of  
> the storage stack, and instead think of it as a fundamental attribute  
> that is honored at the top when a BIO is created.  Instead of loading  
> up an sglist with all of the pages (and don't forget coalesced pages  
> that might need to be broken up), maybe multiple bio's are created  
> that honor maxio from the start, or a single bio with a chained  
> sglist, with each chain link honoring maxio, allowing for easy  
> splitting.

It may be that the splitting that geom_dev does is done at the wrong layer;
I'm not debating that. :)  I attempted to make sglist work efficiently with
what is there now and other things like striping will also want to use cheap
splitting of buffers.

> >  I then have bus_dmamap_load_bio() use the subrange of the
> > sglist internally or fall back to using the KVA pointer if the sglist
> > isn't present.
> 
> I completely disagree.  Drivers already deal with the details of  
> bio's, and should continue to do so.  If a driver gets a bio that has  
> a valid bio_data pointer, it should call bus_dmamap_load().  If it  
> get's one with a valid sglist, it should call bus_dmamap_load_sglist 
> ().  Your proposal means that every storage driver in the system will  
> have to change to use bus_dmamap_load_bio().  It's not a big change,  
> but it's disruptive both in the tree and out.  Your proposal also  
> implies that CAM will have to start carrying BIO's in CCBs and passing  
> them to their SIMs.  I absolutely disagree with this.

Ok.  As I mentioned above, while this does add churn, I think it is less
churn than changing all the drivers to handle the two different types of
bio requests.  I also think it is much less friendly to doing the unmapped
I/O changes in stages that allows the work to progress in parallel in
different areas.  I also believe I specifically mentioned changing CCBs to
pass the bio instead of the raw (data, length) pair when I discussed this
with folks earlier.

> If we keep unneeded complications out of busdma, we avoid a lot of  
> churn.  We also leave the busdma interface available for other forms  
> of I/O without requiring more specific APi additions to accommodate  
> them.  What about unmapped network i/o coming from something like  
> sendfile?

I do have a bus_dmamap_load_sglist() in my tree already.  Do note that we
already have bus_dmamap_load_mbuf() and bus_dmamap_load_uio(), so there is
precedent for letting bus_dma handle slightly more complex data structures
than just a (buffer, length) pair.

> > However, I'm not really trying to get the bio stuff into the tree,  
> > this
> > is mostly for the Nvidia case and for that use case the driver is  
> > simply
> > creating simple single-entry lists and using sglist_append_phys().
> >
> 
> Designing the whole API around a single driver that we can't even get  
> the source to makes it hard to evaluate the API.

The API was designed for the bio stuff, and not for any specific driver.  The 
Nvidia stuff was only done as an afterthought because the sglist(9) structure 
already existed at the time.  It was also designed as a result of the 
discussions among several people and not completely in a vacuum.

> Attilio and I have spoken about this in private and will begin work on  
> a prototype.  Here is the outline of what we're going to do:

For those playing along at home, the things that I suggested to Attilio as far
as the next steps that I would do were to add the following APIs and then make
the necessary changes so that drivers and GEOM modules use these:

- bus_dmamap_load_bio():  Fairly simple.  Just takes a bio instead of (buffer,
  length).
- bio_adjust():  This is a lot like m_adj() but for bio's instead of mbuf's.
  It can be inline, but the point is to have GEOM modules use this to split up
  a bio buffer instead of directly manipulating bio_data and bio_length
  (possibly bio_offset as well).

Once these changes are done, adding support for simple unmapped bio's
consists of adding sglist support to bus_dma for each architecture and
bus_dmamap_load_bio() on each arch.  Then upper layer code could start using
unmapped bios after that (I had hacky prototype changes to physio).

There would still be several big things to work out, such as GEOM modules
that need to manipulate the data and not just pass it through.  phk's
suggestion here was to have the driver or GEOM module fail the request with a
magic error code.  The originator was then supposed to map the buffer and
retry the request.  Presumably one could note the first time a given device
object failed a request that way and always send down mapped requests
afterwards to avoid delays in subsequent I/O requests.  There are other ways
of handling this problem as well I imagine.  I have not made any attempt to
solve this problem.

Also, the changes Jeff has discussed with regards to tearing up
getpages/putpages and the buffer cache in general to take advantage of
unmapped bios are a separate animal that would build on this stuff further.
I have not made any attempt at this either.

I do find the idea of chaining sglist's together interesting.  It would lose
one of the "benefits" of the current layout which is that the segment array is
ABI-compatible with bus_dma's S/G list format so that in the common case the
sglist that physio or getpages/putpages would generate could be passed 
directly to the device driver's bus_dma callback without having to generate an
intermediate data structure.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200911301427.23166.jhb>