Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 23 Oct 2002 17:43:50 -0400 (EDT)
From:      Jeff Roberson <jroberson@chesapeake.net>
To:        arch@freebsd.org
Subject:   KVAless IO and buffer cache changes
Message-ID:  <20021023165651.W22147-100000@mail.chesapeake.net>

next in thread | raw e-mail | index | archive | help
For some time now I have been discussing various ways that we might
address some of our many buffer cache and vm related IO issues.  I have
been collecting problems from my experiences with FreeBSD in IO intensive
large memory environments.  Various people have given me feedback over
time and based on this I have a design for fixing up our buffer cache.
I'd like to bring my proposal to a wider audience now so I can avoid
duplicated efforts with other poeple and so that I may get feedback from
more than the few developers that I have contacted direcly.

First, I'd like to cover some of the problems that I currently see.

1)  VFS, VM, filesystems and the buffer cache have no clean layering.  The
dependencies, duplication of efforts, and call stacks can give anyone a
headache.

2)  The buf cache has some notion of related pages for IO purposes.  This
information is not shared with the VM so at pageout and fault time it has
and ad-hoc method for doing clustered IO.  This leads to different IO
strategies depending on whether you're mmaping or doing normal file IO.
It can also lead to conflicting efforts.  ie, a poorly placed faults can
lead to short clustering from vfs_cluster.c

3)  In practice, bogus_page is used quite a bit more than you would
expect.  Again, this is because the vm does not know about related pages.
When we free one page that constitutes a buf we should throw away his
neighbors as well.  In many systems you end up with lots of bufs that are
partially free and require extra IO and bogus page replacement.  If all
pages were thrown away simultaneously when possible, you would end up with
the same number of cached pages but a greater number of fully valid bufs.

4) All IO in the system requires KVA even when the kernel is not accessing
the resulting memory through virtual addresses.  This bogus requirement
leads us to have things like pbufs.  It also means our buffer cache must
take up a big chunk of KVA.  These two resources are deadlock prone and
fragmentation prone.  Also, maping and unmaping these things can be very
expensive.  Right now we map bufs in vfs_bio just so they can be unmapped
again in bus dma.

5) msync and fsync are seperate operations that really could be merged.
There is a duplication of efforts all over the vm and buf interfaces
because we have two systems that are trying to achive similar effects.


As you can see, there are very real performance penalties with the current
arrangement.  In addition, this code is overly complicated because it is
not well abstracted.  vfs_bio is trying to do too many things.

I'm proposing the following:

A vectorized, address space tagged bio with queue points for delayed IO.
This will allow us to do IO to physical addresses.  It would make bio look
more like UIO, but instead of iovecs we might want to try bus dma
segments.  In addition to this, the delayed write layer should be a small
module that deals only with bios.  This would require making bio a real
object to support the many users.  Once this is done pbufs could go away
entirely.  The vm could do IO entirely through get/put pages directly to
physical memory.

The relationship of pages that comprise a block should be pushed into the
vnode pager.  This layer would take a request for an individual page and
translate that into a group of pages that io could be done on.  This way
the clustering for vm fault and vm pageout would be explicit and on
reasonable boundaries.  It would also give you an oportunity to do read
ahead and clustering from a centralized place.  All IO to vnodes should go
through this layer.

Bufs would be reduced to kva containers and dependency trackers for file
system metadata.  They could be attached to a vnode pager block, anonymous
memory, or malloced data just as they are now.  For the latter two they
would have an embedded bio that would use the delayed write interface.
The vnode pager would have it's own bio for manging delayed writes.  This
change would have minimal impact on the filesystem.  The currently use
bread et all when dealing with metadata.  They are already telling the buf
system when they actually want to look at the contents of a buf.  The
filesystem read/write routines could then be modified to bypass this layer
and talk directly to the vnode pager.

I have many pages of design notes.  If there is enough interest I could
write up a formal design proposal.  Any feedback is welcome.

Cheers,
Jeff


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20021023165651.W22147-100000>