Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 04 Feb 2008 08:55:14 -0700
From:      Scott Long <scottl@samsco.org>
To:        Andriy Gapon <avg@icyb.net.ua>
Cc:        freebsd-fs@FreeBSD.org, freebsd-hackers@FreeBSD.org, Remko Lodder <remko@FreeBSD.org>, Pav Lucistnik <pav@FreeBSD.org>, Bruce Evans <bde@zeta.org.au>
Subject:   Re: fs/udf: vm pages "overlap" while reading  large dir
Message-ID:  <47A73562.4010309@samsco.org>
In-Reply-To: <47A7339E.4070708@icyb.net.ua>
References:  <200612221824.kBMIOhfM049471@freefall.freebsd.org> <47A2EDB0.8000801@icyb.net.ua> <47A2F404.7010208@icyb.net.ua> <47A7339E.4070708@icyb.net.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Andriy Gapon wrote:
> More on the problem with reading big directories on UDF.
> 
> First, some sleuthing. I came to believe that the problem is caused by
> some larger change in vfs/vm/buf area. It seems that now VMIO is applied
> to more vnode types than before. In particular it seems that now vnodes
> for devices have non-NULL v_object (or maybe this is about directory
> vnodes, I am not sure).
> 
> Also it seems that the problem should happen for any directory with size
> larger than four 2048-bytes sectors (I think that any directory with >
> 300 files would definitely qualify).
> 
> After some code reading and run-time debugging, here are some facts
> about udf directory reading:
> 1. bread-ing is done via device vnode (as opposed to directory vnodes),
> as a consequence udf_strategy is not involved in directory reading
> 2. the device vnode has a non-NULL v_object, this means that VMIO is used
> 3. it seems that the code assumed that non-VM buffers are used
> 4. bread is done on as much data as possible, up to MAXBSIZE [and even
> slightly more in some cases] (i.e. code determines directory data size
> and attempts to read in everything in one go)
> 5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is
> passed to bread - this is because of #1 above and peculiarities of buf
> system
> 6. directory reading and file reading are quite different, because the
> latter does reading via file vnode
> 
> Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K)
> bytes starting from a physical sector N with N%2 = 0 from media with
> sector size of 2K (most common CD/DVD case):
> 1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N,
> N%2 = 0
> 2. 4*N is passed as a sector number to bread
> 3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K
> 4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE =
> N*2K, i.e. correct offset within the device
> 5. for a fresh read getblk will be called
> 6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the
> correct byte offset within the device
> 7. allocbuf will allocate pages using B_VMIO case
> 8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE
> this means the following:
> sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N
> sectors N+2 and N+3 will be bound to the next page 2*N +1
> sectors N+4 and N+5 is tied to the next page 2*N + 2
> 
> Now let's consider a "directory read" of 2 sectors (1 page) starting at
> physical sector N+1.
> Using the same calculations as above, we see that the sector will be
> tied to a page 2*(N+1) = 2*N + 2.
> And this page already contains valid cached data from the previous read,
> *but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2.
> 
> So, we see, that because of b_offset being 4 times the actual byte
> offset, we get incorrect page indexing. Namely, sector X gets associated
> with different pages depending on what sector is used as a starting
> sector for bread. If bread starts at sector N, then data of sector N+1
> is associated with (second half of) page with index 2*N; but if bread
> starts at sector N+1 then it gets associated with (the first half of)
> page with index 2*(N+1).
> 
> Previously, before VMIO change, data for all reads was put into separate
> buffers that did not share anything between them. So the problem was
> limited only to wasting memory with duplicate data, but no actual
> over-runs did happen. Now we have the over-runs because the VM pages are
> shared between the buffers of the same vnode.
> 
> One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 *
> sector_size. In this case, as before, we would waste some memory on
> duplicate data but we would avoid page overruns. If we limit bread size
> even more, to 1 sector, then we would not have any duplicate data at
> all. But there would still be some resource waste - each page would
> correspond to one sector, so 4K page would have only 2K of valid data
> and the other half in each page is unused.
> 
> Another solution, which to me seems to be better, is to do "usual"
> reading - pass a directory vnode and 2048-byte sector offset to bread.
> In this case udf_strategy will get called for bklno translation, all
> offsets and indexes will be correct and everything will work perfectly
> as it is the case for all other filesystems.
> The only overhead in this case comes from the need to handle
> UDF_INVALID_BMAP case (where data is embedded into file entry). So it
> means that we have to call bmap_internal to see if we have that special
> case and hanlde it, and if the case is not special we would call bread
> on a directory vnode which means that bmap_internal would be called
> again in udf_strategy.
> One optimization that can be done in this case is to create a ligher
> version of bmap_internal that would merely check for the special case
> and wouldn't do anything else.
> 
> I am attaching a patch based on the ideas above. It fixes the problem
> for me and doesn't seem to create any new ones :-)
> About the patch:
> hunk #1 fixes a copy/paste
> hunk #2 should fixes strategy to not set junk b_blkno in case of
> udf_bmap_internal failure and also replaces divisions/multiplications
> with bit-shifts
> hunk #3 really limits max size in bread to MAXBSIZE and also makes bread
> to be done a directory vnode
> 
> BTW, maybe it's a good idea to further limit size of bread() anyway.
> Like always reading 1 sector. Anyway, this should make any difference
> only in minority of cases and I am not sure about performance effects (I
> do not expect difference in anything else other than performance).
> 

I think you forgot to attach the patch =-)  This is an excellent 
write-up, and I think you're on the right track.  It's been nearly 8 
years since I wrote most of this code, so my memory is a bit fuzzy, but
I think the reason that I used the device vnode is because I was having
trouble with overlapping buffers when using the directory vnode, so this
was the easiest way to avoid that at the time.  Since it sounds like 
this is no longer a viable solution, I definitely support your ideas.

Scott




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?47A73562.4010309>