From owner-freebsd-hackers  Sun Dec 14 23:24:31 1997
Return-Path: <owner-freebsd-hackers>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id XAA29664
          for hackers-outgoing; Sun, 14 Dec 1997 23:24:31 -0800 (PST)
          (envelope-from owner-freebsd-hackers)
Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132])
          by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id XAA29659
          for <hackers@FreeBSD.ORG>; Sun, 14 Dec 1997 23:24:27 -0800 (PST)
          (envelope-from tlambert@usr09.primenet.com)
Received: (from daemon@localhost)
	by smtp02.primenet.com (8.8.8/8.8.8) id AAA00141;
	Mon, 15 Dec 1997 00:23:57 -0700 (MST)
Received: from usr09.primenet.com(206.165.6.209)
 via SMTP by smtp02.primenet.com, id smtpd000113; Mon Dec 15 00:23:39 1997
Received: (from tlambert@localhost)
	by usr09.primenet.com (8.8.5/8.8.5) id AAA27555;
	Mon, 15 Dec 1997 00:23:36 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199712150723.AAA27555@usr09.primenet.com>
Subject: Re: blocksize on devfs entries (and related)
To: mike@smith.net.au (Mike Smith)
Date: Mon, 15 Dec 1997 07:23:36 +0000 (GMT)
Cc: tlambert@primenet.com, mike@smith.net.au, bgingery@gtcs.com,
        hackers@FreeBSD.ORG
In-Reply-To: <199712150642.RAA01358@word.smith.net.au> from "Mike Smith" at Dec 15, 97 05:12:41 pm
X-Mailer: ELM [version 2.4 PL23]
Content-Type: text
Sender: owner-freebsd-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

> > I dealt with this when I changed the DIRBLKSZ from 512 to 1024 in the

[ ... ]

> Ok.  By this you confess that you understand the issues, and you have 
> dealt with them comprehensively before.  I do hope that Artisoft let 
> you take your notes with you... 8)

They can't remove them without a scalpal.  I had done a similar change
to UFS at Novell, previously.  It's common techoology, I think.


> > It would have to set the DIRBLKSZ to the minimum of the amount required
> > and the physical block size, and deal with agregating multiple blocks.
> 
> Pardon my ignorance, but what's the issue with aggregating blocks?  I 
> presume that the directory code reads/writes in multiples of DIRBLKSZ

Soft updates.

If you look at the UFS/FFS code, you'll see that directory blocks are 512b,
and all other metadata structures are a block size or some integer factor
of a blocksize (inodes are 128b -- 4 per disk block) in size.

The metadata synchronicity guarantees are largely predicated on the idea
that no metadata modification will be larger than a block.  This is with
'good reason, now that the cylinder boundries are unknown.

Consider a system running soft updates, in the face of a power failure,
with a two block write crossing a cylinder boundry.  This will result
in a seek.  All but the most modern drives (which can power track buffer
writes and a seek through rotational energy of the disk) will fail to
commit the write properly.

This would be bad.

It is a big issue in considering a switch to soft updates, and it's
a meta-issue for high availability systems, where an operation against
a database is not atomic, but must be idempotent.

So long as a single "transaction" can't cross a seek boundry, this is not
an issue.  When a single transaction can, then it becomes a *big* issue.

I'm pretty sure McKusick's soft updates implementation assumes physical
block atomicity for metadata updates.


> How is an update of a group of blocks any less atomic than the update 
> of a single block?

See above.

> Is this only an issue in the light of fragmentation 
> of a directory on a non-DIRBLKSZ multiple boundary?

No.  One would also have to consider LFS extents "fragmented" across a
seek boundry, in the event that the soft updates implementation is ever
extended to apply to anything other than an FFS/UFS (it's current
limitation -- it's not a general event-graph soloution to the ordering
guarantee problem).

This is also a general issue for ACL's, OS/2 Extended attributes, HFS
resource forks, and any other construct that pushes metadata (potentially)
over a seek boundry.  NTFS and VIVA are other examples that comes to mind.


> > > > Consider a FAT FS.  A FAT FS deals with 1K blocks.  But these 1K blocks
> > > > are not constrained to start at an even offset from the start of the
> > > > disk, only from an even cylinder boundry.
> > > 
> > > In the light of the nonexistence of "cylinders" in the proposed model, 
> > > it strikes me that this becomes an issue of synthesising a conforming 
> > > pseudo-geometry at filesystem creation time, and little more.  
> > > Compatability is likely an issue there.
> > 
> > Page alignment must equal 1k block alignment for the 0th of 4 blocks;
> > alternately, access will be... uh, slow.
> 
> I think you've left out enough context there that I can't approach any 
> of that.  Unless you are referring to mmap() on such files.

No.  The mmap() issue is seperate.

Consider two 4k pages on disk:

[              4k              ][              4k              ]

This is made up of physical blocks:

[ 512b ][ 512b ][ 512b ][ 512b ][ 512b ][ 512b ][ 512b ][ 512b ]

These physical blocks are agregated, begining on a "cylinder boundry",
to form "FAT blocks":

[ 512b ][      1k      ][      1k      ][      1k      ][ 512b ]
                               ^
                               |
                            page boundry

Now we have an FAT "block" that spans a page boundry.

To access (read/write) the block, we must access two pages -- 8k.

This is slow.

There is an unsigned char associated with the in core page reference
structure.  The bits of this value are intended to indicate validity
of non-fault-based (ie: not mmap()) presence do that you can do block
level instead of page level I/O (for devices with a blocksize of more
than 512b, for a 4k page).

This would require an access of 1k (two 512b blocks) to reference the
page.

Assumming this is implemented (the bits are there, but unused for this
purpose), this still leaves the need to do a page mapping.

To combat this, the "device" for a FATFS must begin on the mythical
"cylinder boundry", so that one, not two, page mappings are required
for random access of the FAT.  This will result in the 1k "blocks"
being aligned on page boundries for the device.

Not doing these things is "uh, slow".

This is one of the reasons the Linux FATFS implementation kicks the
FreeBSD's implementation's butt on performance metrics.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.