From owner-freebsd-hackers Sun Dec 14 23:24:31 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id XAA29664 for hackers-outgoing; Sun, 14 Dec 1997 23:24:31 -0800 (PST) (envelope-from owner-freebsd-hackers) Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id XAA29659 for ; Sun, 14 Dec 1997 23:24:27 -0800 (PST) (envelope-from tlambert@usr09.primenet.com) Received: (from daemon@localhost) by smtp02.primenet.com (8.8.8/8.8.8) id AAA00141; Mon, 15 Dec 1997 00:23:57 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp02.primenet.com, id smtpd000113; Mon Dec 15 00:23:39 1997 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id AAA27555; Mon, 15 Dec 1997 00:23:36 -0700 (MST) From: Terry Lambert Message-Id: <199712150723.AAA27555@usr09.primenet.com> Subject: Re: blocksize on devfs entries (and related) To: mike@smith.net.au (Mike Smith) Date: Mon, 15 Dec 1997 07:23:36 +0000 (GMT) Cc: tlambert@primenet.com, mike@smith.net.au, bgingery@gtcs.com, hackers@FreeBSD.ORG In-Reply-To: <199712150642.RAA01358@word.smith.net.au> from "Mike Smith" at Dec 15, 97 05:12:41 pm X-Mailer: ELM [version 2.4 PL23] Content-Type: text Sender: owner-freebsd-hackers@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk > > I dealt with this when I changed the DIRBLKSZ from 512 to 1024 in the [ ... ] > Ok. By this you confess that you understand the issues, and you have > dealt with them comprehensively before. I do hope that Artisoft let > you take your notes with you... 8) They can't remove them without a scalpal. I had done a similar change to UFS at Novell, previously. It's common techoology, I think. > > It would have to set the DIRBLKSZ to the minimum of the amount required > > and the physical block size, and deal with agregating multiple blocks. > > Pardon my ignorance, but what's the issue with aggregating blocks? I > presume that the directory code reads/writes in multiples of DIRBLKSZ Soft updates. If you look at the UFS/FFS code, you'll see that directory blocks are 512b, and all other metadata structures are a block size or some integer factor of a blocksize (inodes are 128b -- 4 per disk block) in size. The metadata synchronicity guarantees are largely predicated on the idea that no metadata modification will be larger than a block. This is with 'good reason, now that the cylinder boundries are unknown. Consider a system running soft updates, in the face of a power failure, with a two block write crossing a cylinder boundry. This will result in a seek. All but the most modern drives (which can power track buffer writes and a seek through rotational energy of the disk) will fail to commit the write properly. This would be bad. It is a big issue in considering a switch to soft updates, and it's a meta-issue for high availability systems, where an operation against a database is not atomic, but must be idempotent. So long as a single "transaction" can't cross a seek boundry, this is not an issue. When a single transaction can, then it becomes a *big* issue. I'm pretty sure McKusick's soft updates implementation assumes physical block atomicity for metadata updates. > How is an update of a group of blocks any less atomic than the update > of a single block? See above. > Is this only an issue in the light of fragmentation > of a directory on a non-DIRBLKSZ multiple boundary? No. One would also have to consider LFS extents "fragmented" across a seek boundry, in the event that the soft updates implementation is ever extended to apply to anything other than an FFS/UFS (it's current limitation -- it's not a general event-graph soloution to the ordering guarantee problem). This is also a general issue for ACL's, OS/2 Extended attributes, HFS resource forks, and any other construct that pushes metadata (potentially) over a seek boundry. NTFS and VIVA are other examples that comes to mind. > > > > Consider a FAT FS. A FAT FS deals with 1K blocks. But these 1K blocks > > > > are not constrained to start at an even offset from the start of the > > > > disk, only from an even cylinder boundry. > > > > > > In the light of the nonexistence of "cylinders" in the proposed model, > > > it strikes me that this becomes an issue of synthesising a conforming > > > pseudo-geometry at filesystem creation time, and little more. > > > Compatability is likely an issue there. > > > > Page alignment must equal 1k block alignment for the 0th of 4 blocks; > > alternately, access will be... uh, slow. > > I think you've left out enough context there that I can't approach any > of that. Unless you are referring to mmap() on such files. No. The mmap() issue is seperate. Consider two 4k pages on disk: [ 4k ][ 4k ] This is made up of physical blocks: [ 512b ][ 512b ][ 512b ][ 512b ][ 512b ][ 512b ][ 512b ][ 512b ] These physical blocks are agregated, begining on a "cylinder boundry", to form "FAT blocks": [ 512b ][ 1k ][ 1k ][ 1k ][ 512b ] ^ | page boundry Now we have an FAT "block" that spans a page boundry. To access (read/write) the block, we must access two pages -- 8k. This is slow. There is an unsigned char associated with the in core page reference structure. The bits of this value are intended to indicate validity of non-fault-based (ie: not mmap()) presence do that you can do block level instead of page level I/O (for devices with a blocksize of more than 512b, for a 4k page). This would require an access of 1k (two 512b blocks) to reference the page. Assumming this is implemented (the bits are there, but unused for this purpose), this still leaves the need to do a page mapping. To combat this, the "device" for a FATFS must begin on the mythical "cylinder boundry", so that one, not two, page mappings are required for random access of the FAT. This will result in the 1k "blocks" being aligned on page boundries for the device. Not doing these things is "uh, slow". This is one of the reasons the Linux FATFS implementation kicks the FreeBSD's implementation's butt on performance metrics. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.