Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 12 Feb 2008 15:59:13 +0200
From:      Andriy Gapon <avg@icyb.net.ua>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>, Pav Lucistnik <pav@freebsd.org>, Bruce Evans <bde@zeta.org.au>
Subject:   Re: fs/udf: vm pages "overlap" while reading large dir
Message-ID:  <47B1A631.1000504@icyb.net.ua>
In-Reply-To: <20080212234018.O92415@delplex.bde.org>
References:  <4299.1202816854@critter.freebsd.dk> <20080212234018.O92415@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
on 12/02/2008 15:11 Bruce Evans said the following:
> On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:
> 
>> In message <47B181F2.2070808@icyb.net.ua>, Andriy Gapon writes:
>>
>>> 2.3. this code passes to bread blkno that is calculated as 4*sector,
>>> where sector is a number of a physical 2048-byte sector. [**]
>>> [**] - I think that this is a requirement of buffcache system, because
>>> internally it performs many calculations that seem to assume that block
>>> size is always 512.
>> Yes, the buf-cache works in 512 bytes units throughout.
> 
> I missed the dbtob() conversions in vfs_bio.c when I replied previously
> So blkno cannot be a cookie; it must be for a 512-block.  So how did
> the cases where bsize != DEV_BSIZE in getblk() ever work?

It seems that this is because specific VOP_STRATEGY and/or BO_STRATEGY
kicked in and did the right thing, see below.

>>> 3.1. for a fresh buf getlbk would assign the following:
>>> bsize = bo->bo_bsize;
>>> offset = blkno * bsize;
>> That's clearly wrong.
> 
> If units were always 512-blocks, then anything except bsize = DEV_BSIZE
> would be clearly wrong.  Things aren't that simple (but probably should
> be).  Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks.

O, I missed this obvious thing!
Bruce, thank you for putting me back on the ground :-)
Even in UDF case, when we bread() via a file or directory vnode we pass
a logical 2048-byte block number (within the file) to bread(). In this
case the code in getblk() does the right thing when it calculates offset
as blkno * 2048. Calculating it assuming 512-byte units would be a disaster.
And the actual reading works correctly because udf_strategy is called
for such vnodes and it translates block numbers from physical to logical
and also correctly re-calculates b_iooffset for actual reading. So
b_iooffset value set in breadn() (using bdtob) is completely ignored.

I remember that this is why g_vfs_* was my primary target.
It seems that I revert my opinion back: either g_vfs_open should be
smart about setting bo_bsize, or g_vfs_strategy should be smart about
calculating bio_offset, or individual filesystems should not depend on
g_vfs_* always doing the best thing for them.

In any case, it seems that it is the g_vfs_* that is currently
inconsistent: it sets bo_bsize to a somewhat arbitrary value, but
expects that it would always be provided with correct b_iooffset (the
latter being rigidly calculated via bdtob() in buffcache code). So this
leaves filesystems dead in the water while reading via device vnode:
provide blkno in 512-byte units - screw up VM cache, provide blkno in
bo_bsize units - screw up reading from disk.

Not sure if the FS-s should have wits to set bo_bsize properly and/or
provide proper bo_ops - we are talking about a device vnode here.
Should filesystems be involved in the business of setting its
properties? Not sure.
But it seems that there are many possibilities for "fixups" and various
filesystems [have to] do stuff in their unique ways (*_STRATEGY, etc).

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?47B1A631.1000504>