Date: Wed, 23 Oct 2002 17:43:50 -0400 (EDT) From: Jeff Roberson <jroberson@chesapeake.net> To: arch@freebsd.org Subject: KVAless IO and buffer cache changes Message-ID: <20021023165651.W22147-100000@mail.chesapeake.net>
next in thread | raw e-mail | index | archive | help
For some time now I have been discussing various ways that we might address some of our many buffer cache and vm related IO issues. I have been collecting problems from my experiences with FreeBSD in IO intensive large memory environments. Various people have given me feedback over time and based on this I have a design for fixing up our buffer cache. I'd like to bring my proposal to a wider audience now so I can avoid duplicated efforts with other poeple and so that I may get feedback from more than the few developers that I have contacted direcly. First, I'd like to cover some of the problems that I currently see. 1) VFS, VM, filesystems and the buffer cache have no clean layering. The dependencies, duplication of efforts, and call stacks can give anyone a headache. 2) The buf cache has some notion of related pages for IO purposes. This information is not shared with the VM so at pageout and fault time it has and ad-hoc method for doing clustered IO. This leads to different IO strategies depending on whether you're mmaping or doing normal file IO. It can also lead to conflicting efforts. ie, a poorly placed faults can lead to short clustering from vfs_cluster.c 3) In practice, bogus_page is used quite a bit more than you would expect. Again, this is because the vm does not know about related pages. When we free one page that constitutes a buf we should throw away his neighbors as well. In many systems you end up with lots of bufs that are partially free and require extra IO and bogus page replacement. If all pages were thrown away simultaneously when possible, you would end up with the same number of cached pages but a greater number of fully valid bufs. 4) All IO in the system requires KVA even when the kernel is not accessing the resulting memory through virtual addresses. This bogus requirement leads us to have things like pbufs. It also means our buffer cache must take up a big chunk of KVA. These two resources are deadlock prone and fragmentation prone. Also, maping and unmaping these things can be very expensive. Right now we map bufs in vfs_bio just so they can be unmapped again in bus dma. 5) msync and fsync are seperate operations that really could be merged. There is a duplication of efforts all over the vm and buf interfaces because we have two systems that are trying to achive similar effects. As you can see, there are very real performance penalties with the current arrangement. In addition, this code is overly complicated because it is not well abstracted. vfs_bio is trying to do too many things. I'm proposing the following: A vectorized, address space tagged bio with queue points for delayed IO. This will allow us to do IO to physical addresses. It would make bio look more like UIO, but instead of iovecs we might want to try bus dma segments. In addition to this, the delayed write layer should be a small module that deals only with bios. This would require making bio a real object to support the many users. Once this is done pbufs could go away entirely. The vm could do IO entirely through get/put pages directly to physical memory. The relationship of pages that comprise a block should be pushed into the vnode pager. This layer would take a request for an individual page and translate that into a group of pages that io could be done on. This way the clustering for vm fault and vm pageout would be explicit and on reasonable boundaries. It would also give you an oportunity to do read ahead and clustering from a centralized place. All IO to vnodes should go through this layer. Bufs would be reduced to kva containers and dependency trackers for file system metadata. They could be attached to a vnode pager block, anonymous memory, or malloced data just as they are now. For the latter two they would have an embedded bio that would use the delayed write interface. The vnode pager would have it's own bio for manging delayed writes. This change would have minimal impact on the filesystem. The currently use bread et all when dealing with metadata. They are already telling the buf system when they actually want to look at the contents of a buf. The filesystem read/write routines could then be modified to bypass this layer and talk directly to the vnode pager. I have many pages of design notes. If there is enough interest I could write up a formal design proposal. Any feedback is welcome. Cheers, Jeff To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20021023165651.W22147-100000>