From owner-freebsd-hackers@FreeBSD.ORG Tue Feb 12 12:33:42 2008 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6CA3516A419; Tue, 12 Feb 2008 12:33:42 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id EBEB513C4E8; Tue, 12 Feb 2008 12:33:41 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-219-213.carlnfd3.nsw.optusnet.com.au (c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213]) by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m1CCXa7n029892 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 12 Feb 2008 23:33:38 +1100 Date: Tue, 12 Feb 2008 23:33:36 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Poul-Henning Kamp In-Reply-To: <3027.1202806417@critter.freebsd.dk> Message-ID: <20080212220951.I92195@delplex.bde.org> References: <3027.1202806417@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Mailman-Approved-At: Tue, 12 Feb 2008 13:45:45 +0000 Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Pav Lucistnik , Andriy Gapon , Bruce Evans Subject: Re: fs/udf: vm pages "overlap" while reading large dir X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Feb 2008 12:33:42 -0000 On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: > In message <47B0CE54.6090204@icyb.net.ua>, Andriy Gapon writes: >> on 06/02/2008 18:29 Andriy Gapon said the following: >>> Small summary of the above long description. >>> For directory reading fs/udf performs bread() on a (underlying) device >>> vnode. It passes block number as if block size was 512 bytes (i.e. >>> byte_offset_within_dev/512). > > We have three sizes of relevance here, the sectorsize of the provider, > the blocksize of the filesystem and the page size of the system. 4. The block size required for bread() and friends (almost always DEV_BSIZE). > In general it is adventurous to have any of them be anything but > powers of two, and it is at best ill-adviced and more likely asking > for trouble to do requests that are not multiple of and aligned to > the sectorsize of the provider. > > So up front, I would say that it is an UDF bug to do 512 byte reads off > a 2k sector provider. > > Making the buffer cache handle this is possible, but it is not the > direction we have planned for the buffercache (ie: that it should > become a wrapper for using the VM system, rather than being a > separately allocated cache). > > So if the objective is to make UDF work in the short term, your > change might work, if the objective is to move FreeBSD's kernel > architecture forward, the right thing to do is to fix UDF to not > access subsectors. This bug has nothing to do with subsectors, and very little to do with udf. udf is just depending on bread()'s API being unbroken. That API requires starting with blocks consisting of whole sectors (else the subsequent I/O would fail) and converting to blocks of size DEV_BSIZE, phyexcept for unusual (non-disk?) file systems like nfs (and no others?). All (?) disk file systems do this conversion. The bug is just more noticeable for udf since it is used more often on disks with a block size != DEV_BSIZE, and it apparently does something that makes the bug mess up VM more than other file systems. Of course, it might be better for the API to take blocks in units of the sector size, but that is not what it takes, and this can't be changed easily or safely. The standard macro btodb() is often used to convert from bytes to blocks of this size, and doesn't support bo_bsize or the bufobj pointer that would be needed to reach bo_bsize. ffs mostly uses its fsbtodb() macro, which converts blocks in ffs block (frag?)-sized units to blocks in DEV_BSIZE units so as to pass them to unbroken bread() and friends. ffs initializes bo_bsize to its block (not frag) size, and then never uses it directly except in ffs_rawread(), where it is used to check that the I/O is in units of whole sectors (otherwise, ffs_rawread() has DEV_BSIZE more hard-coded than the rest of ffs). The details of fsbtodb() are interesting. They show signs of previous attempts to use units of sectors. From ffs/fs.h: % /* % * Turn filesystem block numbers into disk block addresses. % * This maps filesystem blocks to device size blocks. % */ % #define fsbtodb(fs, b) ((daddr_t)(b) << (fs)->fs_fsbtodb) % #define dbtofsb(fs, b) ((b) >> (fs)->fs_fsbtodb) Creation of fs_fsbtodb is left to newfs. The units of DEV_BSIZE are thus built in to the on-disk data struct (in a fairly easy to change and/or override way, but there are a lot of existing disks). From newfs.c: % realsectorsize = sectorsize; % if (sectorsize != DEV_BSIZE) { /* XXX */ % int secperblk = sectorsize / DEV_BSIZE; % % sectorsize = DEV_BSIZE; % fssize *= secperblk; % if (pp != NULL) % pp->p_size *= secperblk; % } % mkfs(pp, special); Though mkfs() appears to support disk blocks of size = sectorsize, its sectorsize parameter is hacked on here, so it always generates fs_fsbtodb and other derived parameters for disk blocks of size DEV_BSIZE, as is required for the unbroken bread() API. msdosfs is similar to ffs here, except it constructs the shift count at mount time, to arrange for always converting to DEV_BSIZE blocks for passing the bread() and friends. udf is effectively similar, but pessimal and disorganized. For the conversion for bread(), it uses btodb(). In udf_bmap(), it constructs a shift count on every call by subtracting DEV_BSHIFT from its block shift count. In udf_strategy(), on every call it constructs a multiplier instead of a shift count, by dividing its block size by DEV_BSIZE. It's weird that the strategy routing, which will soon end up sending sectors to the disk, is converting to DEV_BSIZE units, a size that cannot be the size for udf's normal media. cd9660 uses btodb() for one conversion for bread() and ( - DEV_BSHIFT) in 7 other conversions to DEV_BSIZE units. So it's surprising that any file systems work on CDs and DVDs. Maybe the bug affects udf more because udf crosses page boundaries more. It's also surprising that the bug has such small effects. This seems to be because the blkno arg to bread() and friends is little more than a cookie. Each fs's strategy routine converts it to a byte offset later, and could do this using any unique cookie, but actually uses a simple shift forth and back. All fs's are consistent in their conversion to and from DEV_BSIZE or other units. Not much more than getblk() needs more than a cookie or uses an inconsistent conversion to a byte offset. Bruce