From owner-freebsd-geom@FreeBSD.ORG Fri Mar 26 13:17:44 2010 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6F585106564A; Fri, 26 Mar 2010 13:17:44 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 72A788FC19; Fri, 26 Mar 2010 13:17:43 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id PAA01998; Fri, 26 Mar 2010 15:17:42 +0200 (EET) (envelope-from avg@freebsd.org) Message-ID: <4BACB3F5.7010905@freebsd.org> Date: Fri, 26 Mar 2010 15:17:41 +0200 From: Andriy Gapon User-Agent: Thunderbird 2.0.0.24 (X11/20100319) MIME-Version: 1.0 To: freebsd-fs@freebsd.org, freebsd-geom@freebsd.org References: <4BA0A660.3000902@freebsd.org> In-Reply-To: <4BA0A660.3000902@freebsd.org> X-Enigmail-Version: 0.95.7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: Subject: Re: g_vfs_open and bread(devvp, ...) X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Mar 2010 13:17:44 -0000 Will an offer of a beer help reviewing this change? :-) on 17/03/2010 11:52 Andriy Gapon said the following: > I've given a fresh look to the issue of g_vfs_open and bread(devvp, ...) in > filesystem code. This time I hope to present my reasoning more clearly than I did > in my previous attempts. > For this reason I am omitting historical references and dramatic > examples/demonstrations but they are still available upon request (and in archives). > I hope that my shortened notation in references to the code and data structures > will not be confusing. > > bread() and the API family it belongs to is an interface to buffer cache system. > Buffer cache system can be roughly divided into two parts: cache part and I/O path > part. > I think that it is understood that both parts must have the same notion of a block > size when translating a block number to a byte offset. (If this point is not > clear, I can follow up with another essay). > > In the case of cache code the translation is explicit and simple: > A) offset = blkno * bo_bsize > > In the case of I/O code the translation is not that straightforward, because it > can be altered/overridden by bop_strategy which can in turn hook vop_strategy, etc. > > Let's consider a simple case of a filesystem that: > a) connects to a geom provider via g_vfs_open > b) doesn't modify anything in devvp->v_bufobj (in particular bo_ops and bo_bsize) > c) uses bread(devvp) (e.g. to access an equivalent of superblock, etc) > > Short overview of geom_vfs glue: > 1) g_vfs_open sets devvp->v_bufobj.bo_ops to g_vfs_bufops, where bop_strategy is > g_vfs_strategy > 2) bo_bsize is set to pp->sectorsize > 3) g_vfs_strategy doesn't perform any block-to-offset translation of its own, it > expects b_iooffset to be correctly set and passes its value to bio_offset > > When a filesystem issues bread(devvp) the following happens in the I/O path: > I) bread() calls breadn() > II) in breadn(): bp->b_iooffset = dbtob(bp->b_blkno), that is b_iooffset is set to > blkno * DEV_BSIZE (where DEV_BSIZE is 512) > III) breadn() then calls bstrategy() which is a simple wrapper around BO_STRATEGY > IV) g_vfs_strategy gets called and, as described in (3) above, it simply passes on > b_iooffset value to bio_offset > V) thus, a block size used for I/O operation is 512 (DEV_BSIZE) > VI) on the other hand, as stated in (A) above, block size used in caching code is > bo_bsize > > Thus, if bo_bsize != DEV_BSIZE, or alternatively said, pp->sectorsize != 512, we > have a trouble of data getting cached with incorrect offsets. > > Additionally, from (V) above we must conclude that a filesystem must specify block > numbers in DEV_BSIZE units to bread(devvp, blkno, ...) if the conditions (a), (b), > (c) are met. In fact, all such filesystems already do that, because otherwise > they would read incorrect data from the media. > > So, the problem is only with the caching of the data. > As such, this issue has little practical effects, because only a small number of > reads is done via devvp and only for sufficiently small chunks of data (I hope). > fs/udf used to be greatly affected by this issue when it was reading directory > nodes via devvp, but that was in the past (prior to 189082). > > Still I think that we should fix this issue for general code correctness/quality > reasons. And also to avoid possible future bugs. > > As demonstrated by (V) and (VI) above, obvious and easiest fix is to (always) set > bo_bsize to DEV_BSIZE in g_vfs_open(): > --- a/sys/geom/geom_vfs.c > +++ b/sys/geom/geom_vfs.c > @@ -179,7 +179,7 @@ g_vfs_open(struct vnode *vp, struct g_consumer **cpp, const > char *fsname, int wr > bo = &vp->v_bufobj; > bo->bo_ops = g_vfs_bufops; > bo->bo_private = cp; > - bo->bo_bsize = pp->sectorsize; > + bo->bo_bsize = DEV_BSIZE; > gp->softc = bo; > > return (error); > > I tested this change with ufs, udf and cd9660 and haven't observed any regressions. > > P.S. > There might something to changing bread(devvp) convention, so that blkno is > expected in sectorsize units. But setting bo_bsize to sectorsize is only a tiny > portion of what needs to be changed to make it actually work. > -- Andriy Gapon