1.  The miseducation of struct buf.

      To fully appreciate the topic, I include a little historic overview of struct buf, it is a most enlightening case of not exactly bit-rot but more appropriately design-rot.

      In the beginning, which for this purpose extends until virtual memory is was introduced into UNIX, all disk I/O were done from or to a struct buf. In the 6th edition sources, as printed in Lions Book, struct buf looks like this:

struct buf
{
int b_flags; /* see defines below */
struct buf *b_forw; /* headed by devtab of b_dev */
struct buf *b_back; /* ' */
struct buf *av_forw; /* position on free list, */
struct buf *av_back; /* if not BUSY*/
int b_dev; /* major+minor device name */
int b_wcount; /* transfer count (usu. words) */
char *b_addr; /* low order core address */
char *b_xmem; /* high order core address */
char *b_blkno; /* block # on device */
char b_error; /* returned after I/O */
char *b_resid; /* words not transferred after
error */
} buf[NBUF];

      At this point in time, struct buf had only two functions: To act as a cache and to transport I/O operations to device drivers. For the purpose of this document, the cache functionality is uninteresting and will be ignored.

      The I/O operations functionality consists of three parts:

+ Where in Ram/Core is the data located (b_addr, b_xmem, b_wcount).

+ Where on disk is the data located (b_dev, b_blkno)

+ Request and result information (b_flags, b_error, b_resid)

      In addition to this, the av_forw and av_back elements are used by the disk device drivers to put requests on a linked list. All in all the majority of struct buf is involved with the I/O aspect and only a few fields relate exclusively to the cache aspect.

      If we step forward to the BSD 4.4-Lite-2 release, struct buf has grown a bit here or there:

struct buf {
LIST_ENTRY(buf) b_hash; /* Hash chain. */
LIST_ENTRY(buf) b_vnbufs; /* Buffer's associated vnode. */
TAILQ_ENTRY(buf) b_freelist; /* Free list position if not active. */
struct buf *b_actf, **b_actb; /* Device driver queue when active. */
struct proc *b_proc; /* Associated proc; NULL if kernel. */
volatile long b_flags; /* B_* flags. */
int b_error; /* Errno value. */
long b_bufsize; /* Allocated buffer size. */
long b_bcount; /* Valid bytes in buffer. */
long b_resid; /* Remaining I/O. */
dev_t b_dev; /* Device associated with buffer. */
struct {
caddr_t b_addr; /* Memory, superblocks, indirect etc. */
} b_un;
void *b_saveaddr; /* Original b_addr for physio. */
daddr_t b_lblkno; /* Logical block number. */
daddr_t b_blkno; /* Underlying physical block number. */
/* Function to call upon completion. */
void (*b_iodone) __P((struct buf *));
struct vnode *b_vp; /* Device vnode. */
long b_pfcent; /* Center page when swapping cluster. */
/* XXX pfcent should be int; overld. */
int b_dirtyoff; /* Offset in buffer of dirty region. */
int b_dirtyend; /* Offset of end of dirty region. */
struct ucred *b_rcred; /* Read credentials reference. */
struct ucred *b_wcred; /* Write credentials reference. */
int b_validoff; /* Offset in buffer of valid region. */
int b_validend; /* Offset of end of valid region. */
};

      The main piece of action is the addition of vnodes, a VM system and a prototype LFS filesystem, all of which needed some handles on struct buf. Comparison will show that the I/O aspect of struct buf is in essence unchanged, the length field is now in bytes instead of words, the linked list the drivers can use has been renamed (b_actf, b_actb) and a b_iodone pointer for callback notification has been added but otherwise there is no change to the fields which represent the I/O aspect. All the new fields relate to the cache aspect, link buffers to the VM system, provide hacks for file-systems (b_lblkno) etc etc.

      By the time we get to FreeBSD 3.0 more stuff has grown on struct buf:

struct buf {
LIST_ENTRY(buf) b_hash; /* Hash chain. */
LIST_ENTRY(buf) b_vnbufs; /* Buffer's associated vnode. */
TAILQ_ENTRY(buf) b_freelist; /* Free list position if not active. */
TAILQ_ENTRY(buf) b_act; /* Device driver queue when active. *new* */
struct proc *b_proc; /* Associated proc; NULL if kernel. */
long b_flags; /* B_* flags. */
unsigned short b_qindex; /* buffer queue index */
unsigned char b_usecount; /* buffer use count */
int b_error; /* Errno value. */
long b_bufsize; /* Allocated buffer size. */
long b_bcount; /* Valid bytes in buffer. */
long b_resid; /* Remaining I/O. */
dev_t b_dev; /* Device associated with buffer. */
caddr_t b_data; /* Memory, superblocks, indirect etc. */
caddr_t b_kvabase; /* base kva for buffer */
int b_kvasize; /* size of kva for buffer */
daddr_t b_lblkno; /* Logical block number. */
daddr_t b_blkno; /* Underlying physical block number. */
off_t b_offset; /* Offset into file */
/* Function to call upon completion. */
void (*b_iodone) __P((struct buf *));
/* For nested b_iodone's. */
struct iodone_chain *b_iodone_chain;
struct vnode *b_vp; /* Device vnode. */
int b_dirtyoff; /* Offset in buffer of dirty region. */
int b_dirtyend; /* Offset of end of dirty region. */
struct ucred *b_rcred; /* Read credentials reference. */
struct ucred *b_wcred; /* Write credentials reference. */
int b_validoff; /* Offset in buffer of valid region. */
int b_validend; /* Offset of end of valid region. */
daddr_t b_pblkno; /* physical block number */
void *b_saveaddr; /* Original b_addr for physio. */
caddr_t b_savekva; /* saved kva for transfer while bouncing */
void *b_driver1; /* for private use by the driver */
void *b_driver2; /* for private use by the driver */
void *b_spc;
union cluster_info {
TAILQ_HEAD(cluster_list_head, buf) cluster_head;
TAILQ_ENTRY(buf) cluster_entry;
} b_cluster;
struct vm_page *b_pages[btoc(MAXPHYS)];
int b_npages;
struct workhead b_dep; /* List of filesystem dependencies. */
};

      Still we find that the I/O aspect of struct buf is in essence unchanged. A couple of fields have been added which allows the driver to hang local data off the buf while working on it have been added (b_driver1, b_driver2) and a "physical block number" (b_pblkno) have been added.

      This p_blkno is relevant, it has been added because the disklabel/slice code have been abstracted out of the device drivers, the filesystem ask for b_blkno, the slice/label code translates this into b_pblkno which the device driver operates on.

      After this point some minor cleanups have happened, some unused fields have been removed etc but the I/O aspect of struct buf is still only a fraction of the entire structure: less than a quarter of the bytes in a struct buf are used for the I/O aspect and struct buf seems to continue to grow and grow.

      Since version 6 as documented in Lions book, a three significant pieces of code have emerged which need to do non-trivial translations of the I/O request before it reaches the device drivers: CCD, slice/label and Vinum. They all basically do the same: they map I/O requests from a logical space to a physical space, and the mappings they perform can be 1:1 or 1:N. [note 1]

      The 1:1 mapping of the slice/label code is rather trivial, and the addition of the b_pblkno field catered for the majority of the issues this resulted in, leaving but one: Reads or writes to the magic "disklabel" or equally magic "MBR" sectors on a disk must be caught, examined and in some cases modified before being passed on to the device driver. This need resulted in the addition of the b_iodone_chain field which adds a limited ability to stack I/O operations;

      The 1:N mapping of CCD and Vinum are far more interesting. These two subsystems look like a device driver, but rather than drive some piece of hardware, they allocate new struct buf data structures populates these and pass them on to other device drivers.

      Apart from it being inefficient to lug about a 348 bytes data structure when 80 bytes would have done, it also leads to significant code rot when programmers don't know what to do about the remaining fields or even worse: "borrow" a field or two for their own uses.

     

Conclusions:
+ Struct buf is victim of chronic bloat.

+ The I/O aspect of struct buf is practically constant and only about ¼ of the total bytes.

+ Struct buf currently have several users, vinum, ccd and to limited extent diskslice/label, which need only the I/O aspect, not the vnode, caching or VM linkage.

The I/O aspect of struct buf should be put in a separate struct bio.