Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 26 Mar 2010 15:17:41 +0200
From:      Andriy Gapon <avg@freebsd.org>
To:        freebsd-fs@freebsd.org, freebsd-geom@freebsd.org
Subject:   Re: g_vfs_open and bread(devvp, ...)
Message-ID:  <4BACB3F5.7010905@freebsd.org>
In-Reply-To: <4BA0A660.3000902@freebsd.org>
References:  <4BA0A660.3000902@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help

Will an offer of a beer help reviewing this change? :-)

on 17/03/2010 11:52 Andriy Gapon said the following:
> I've given a fresh look to the issue of g_vfs_open and bread(devvp, ...) in
> filesystem code.  This time I hope to present my reasoning more clearly than I did
> in my previous attempts.
> For this reason I am omitting historical references and dramatic
> examples/demonstrations but they are still available upon request (and in archives).
> I hope that my shortened notation in references to the code and data structures
> will not be confusing.
> 
> bread() and the API family it belongs to is an interface to buffer cache system.
> Buffer cache system can be roughly divided into two parts: cache part and I/O path
> part.
> I think that it is understood that both parts must have the same notion of a block
> size when translating a block number to a byte offset.  (If this point is not
> clear, I can follow up with another essay).
> 
> In the case of cache code the translation is explicit and simple:
> A) offset = blkno * bo_bsize
> 
> In the case of I/O code the translation is not that straightforward, because it
> can be altered/overridden by bop_strategy which can in turn hook vop_strategy, etc.
> 
> Let's consider a simple case of a filesystem that:
> a) connects to a geom provider via g_vfs_open
> b) doesn't modify anything in devvp->v_bufobj (in particular bo_ops and bo_bsize)
> c) uses bread(devvp) (e.g. to access an equivalent of superblock, etc)
> 
> Short overview of geom_vfs glue:
> 1) g_vfs_open sets devvp->v_bufobj.bo_ops to g_vfs_bufops, where bop_strategy is
> g_vfs_strategy
> 2) bo_bsize is set to pp->sectorsize
> 3) g_vfs_strategy doesn't perform any block-to-offset translation of its own, it
> expects b_iooffset to be correctly set and passes its value to bio_offset
> 
> When a filesystem issues bread(devvp) the following happens in the I/O path:
> I) bread() calls breadn()
> II) in breadn(): bp->b_iooffset = dbtob(bp->b_blkno), that is b_iooffset is set to
> blkno * DEV_BSIZE (where DEV_BSIZE is 512)
> III) breadn() then calls bstrategy() which is a simple wrapper around BO_STRATEGY
> IV) g_vfs_strategy gets called and, as described in (3) above, it simply passes on
> b_iooffset value to bio_offset
> V) thus, a block size used for I/O operation is 512 (DEV_BSIZE)
> VI) on the other hand, as stated in (A) above, block size used in caching code is
> bo_bsize
> 
> Thus, if bo_bsize != DEV_BSIZE, or alternatively said, pp->sectorsize != 512, we
> have a trouble of data getting cached with incorrect offsets.
> 
> Additionally, from (V) above we must conclude that a filesystem must specify block
> numbers in DEV_BSIZE units to bread(devvp, blkno, ...) if the conditions (a), (b),
> (c) are met.  In fact, all such filesystems already do that, because otherwise
> they would read incorrect data from the media.
> 
> So, the problem is only with the caching of the data.
> As such, this issue has little practical effects, because only a small number of
> reads is done via devvp and only for sufficiently small chunks of data (I hope).
> fs/udf used to be greatly affected by this issue when it was reading directory
> nodes via devvp, but that was in the past (prior to 189082).
> 
> Still I think that we should fix this issue for general code correctness/quality
> reasons.  And also to avoid possible future bugs.
> 
> As demonstrated by (V) and (VI) above, obvious and easiest fix is to (always) set
> bo_bsize to DEV_BSIZE in g_vfs_open():
> --- a/sys/geom/geom_vfs.c
> +++ b/sys/geom/geom_vfs.c
> @@ -179,7 +179,7 @@ g_vfs_open(struct vnode *vp, struct g_consumer **cpp, const
> char *fsname, int wr
>  	bo = &vp->v_bufobj;
>  	bo->bo_ops = g_vfs_bufops;
>  	bo->bo_private = cp;
> -	bo->bo_bsize = pp->sectorsize;
> +	bo->bo_bsize = DEV_BSIZE;
>  	gp->softc = bo;
> 
>  	return (error);
> 
> I tested this change with ufs, udf and cd9660 and haven't observed any regressions.
> 
> P.S.
> There might something to changing bread(devvp) convention, so that blkno is
> expected in sectorsize units.  But setting bo_bsize to sectorsize is only a tiny
> portion of what needs to be changed to make it actually work.
> 


-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BACB3F5.7010905>