Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 06 Feb 2008 18:29:22 +0200
From:      Andriy Gapon <avg@icyb.net.ua>
To:        Scott Long <scottl@samsco.org>, Poul-Henning Kamp <phk@freebsd.org>
Cc:        freebsd-fs@FreeBSD.org, freebsd-hackers@FreeBSD.org, Remko Lodder <remko@FreeBSD.org>, Pav Lucistnik <pav@FreeBSD.org>, Bruce Evans <bde@zeta.org.au>
Subject:   Re: fs/udf: vm pages "overlap" while reading  large dir
Message-ID:  <47A9E062.3040409@icyb.net.ua>
In-Reply-To: <47A73562.4010309@samsco.org>
References:  <200612221824.kBMIOhfM049471@freefall.freebsd.org> <47A2EDB0.8000801@icyb.net.ua> <47A2F404.7010208@icyb.net.ua> <47A7339E.4070708@icyb.net.ua> <47A73562.4010309@samsco.org>

next in thread | previous in thread | raw e-mail | index | archive | help
on 04/02/2008 17:55 Scott Long said the following:
> Andriy Gapon wrote:
[snip]
>> After some code reading and run-time debugging, here are some facts
>> about udf directory reading:
>> 1. bread-ing is done via device vnode (as opposed to directory vnodes),
>> as a consequence udf_strategy is not involved in directory reading
>> 2. the device vnode has a non-NULL v_object, this means that VMIO is used
[strike out incorrect guess]
>> 4. bread is done on as much data as possible, up to MAXBSIZE [and even
>> slightly more in some cases] (i.e. code determines directory data size
>> and attempts to read in everything in one go)
>> 5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is
>> passed to bread - this is because of #1 above and peculiarities of buf
>> system
>> 6. directory reading and file reading are quite different, because the
>> latter does reading via file vnode
>>
>> Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K)
>> bytes starting from a physical sector N with N%2 = 0 from media with
>> sector size of 2K (most common CD/DVD case):
>> 1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N,
>> N%2 = 0
>> 2. 4*N is passed as a sector number to bread
>> 3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K
>> 4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE =
>> N*2K, i.e. correct offset within the device
>> 5. for a fresh read getblk will be called
>> 6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the
>> correct byte offset within the device
>> 7. allocbuf will allocate pages using B_VMIO case
>> 8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE
>> this means the following:
>> sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N
>> sectors N+2 and N+3 will be bound to the next page 2*N +1
>> sectors N+4 and N+5 is tied to the next page 2*N + 2
[insert] note that bstrategy will get correct params as b_iooffset is
set to blkno*DEV_BSIZE, i.e. correct byte offset.
>>
>> Now let's consider a "directory read" of 2 sectors (1 page) starting at
>> physical sector N+1.
>> Using the same calculations as above, we see that the sector will be
>> tied to a page 2*(N+1) = 2*N + 2.
>> And this page already contains valid cached data from the previous read,
>> *but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2.
>>
>> So, we see, that because of b_offset being 4 times the actual byte
>> offset, we get incorrect page indexing. Namely, sector X gets associated
>> with different pages depending on what sector is used as a starting
>> sector for bread. If bread starts at sector N, then data of sector N+1
>> is associated with (second half of) page with index 2*N; but if bread
>> starts at sector N+1 then it gets associated with (the first half of)
>> page with index 2*(N+1).
[snip]
> I think you forgot to attach the patch =-)  This is an excellent 
> write-up, and I think you're on the right track.  It's been nearly 8 
> years since I wrote most of this code, so my memory is a bit fuzzy, but
> I think the reason that I used the device vnode is because I was having
> trouble with overlapping buffers when using the directory vnode, so this
> was the easiest way to avoid that at the time.  Since it sounds like 
> this is no longer a viable solution, I definitely support your ideas.

[additional analysis of the problem, with new facts that might give rise
to a different solution]

Small summary of the above long description.
For directory reading fs/udf performs bread() on a (underlying) device
vnode. It passes block number as if block size was 512 bytes (i.e.
byte_offset_within_dev/512). On the other hand vm page index calculation
code uses the following formula: (blkno*bo_bsize)/PAGE_SIZE.
For CD/DVD devices bo_bsize is set to 2048. This results in page index
overlap when reading sufficiently many blocks from adjacent starting blocks.
That is, it seems that in "normal" cases page index gets calculated as
byte_offset/PAGE_SIZE, but in this case page index gets to be
4*byte_offset/PAGE_SIZE. More details are in the quoted above text.

With a lot of help from Bruce Evance I found this commit which seems to
have made a big difference:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.458;r2=1.459;f=h

Before this change page index for device vnodes was always
(blkno*512)/PAGE_SIZE. This is because of the vn_isdisk() case.

udf_mountfs device vnode is passed to g_vfs_open() (in geom_vfs.c) and
the latter has the following line of code:
bo->bo_bsize = pp->sectorsize;
Where pp is geom provider for the device in question.

I am not sure if the mentioned above vfs_bio.c change broke something
else in reading from device vnodes or if it fixed something for that. I
am also not sure what would be the consequences of setting bo_bsize to
512 for vnodes of "disk" devices. It would most probably fix current
code in UDF, but I am not sure if it might break anything else.

Just wanted to let the more knowledgeable people know and make a decision.

P.S. bo_bsize seems to be referenced only in a handful of files and in
most of them it seems that it is related to "file" vnodes (as opposed to
"device" vnodes).

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?47A9E062.3040409>