From owner-freebsd-hackers  Sun Dec 14 23:47:54 1997
Return-Path: <owner-freebsd-hackers>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id XAA01416
          for hackers-outgoing; Sun, 14 Dec 1997 23:47:54 -0800 (PST)
          (envelope-from owner-freebsd-hackers)
Received: from word.smith.net.au (vh1.gsoft.com.au [203.38.152.122])
          by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id XAA01382
          for <hackers@FreeBSD.ORG>; Sun, 14 Dec 1997 23:47:39 -0800 (PST)
          (envelope-from mike@word.smith.net.au)
Received: from word (localhost [127.0.0.1])
	by word.smith.net.au (8.8.8/8.8.5) with ESMTP id SAA01554;
	Mon, 15 Dec 1997 18:12:07 +1030 (CST)
Message-Id: <199712150742.SAA01554@word.smith.net.au>
X-Mailer: exmh version 2.0zeta 7/24/97
To: Terry Lambert <tlambert@primenet.com>
cc: mike@smith.net.au (Mike Smith), bgingery@gtcs.com, hackers@FreeBSD.ORG
Subject: Re: blocksize on devfs entries (and related) 
In-reply-to: Your message of "Mon, 15 Dec 1997 07:23:36 -0000."
             <199712150723.AAA27555@usr09.primenet.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Mon, 15 Dec 1997 18:12:06 +1030
From: Mike Smith <mike@smith.net.au>
Sender: owner-freebsd-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

> > > It would have to set the DIRBLKSZ to the minimum of the amount required
> > > and the physical block size, and deal with agregating multiple blocks.
> > 
> > Pardon my ignorance, but what's the issue with aggregating blocks?  I 
> > presume that the directory code reads/writes in multiples of DIRBLKSZ
> 
> Soft updates.
> 
> If you look at the UFS/FFS code, you'll see that directory blocks are 512b,
> and all other metadata structures are a block size or some integer factor
> of a blocksize (inodes are 128b -- 4 per disk block) in size.
> 
> The metadata synchronicity guarantees are largely predicated on the idea
> that no metadata modification will be larger than a block.  This is with
> 'good reason, now that the cylinder boundries are unknown.
> 
> Consider a system running soft updates, in the face of a power failure,
> with a two block write crossing a cylinder boundry.  This will result
> in a seek.  All but the most modern drives (which can power track buffer
> writes and a seek through rotational energy of the disk) will fail to
> commit the write properly.
> 
> This would be bad.

Unfortunately, the above reasoning is soggy.  In the face of a power 
failure, there is no guarantee that a given block will be completely 
updated, so the current "guaranteed atomicity" for single-block writes 
doesn't exist.

> It is a big issue in considering a switch to soft updates, and it's
> a meta-issue for high availability systems, where an operation against
> a database is not atomic, but must be idempotent.

Let me just make sure I've got this: the basic point of concern is that 
a multiblock transaction with storage may (under exceptional 
conditions) only partially complete, and this partial completion (may) 
lead to states from which recovery would be problematic.

Correct?  Hmm.

> So long as a single "transaction" can't cross a seek boundry, this is not
> an issue.  When a single transaction can, then it becomes a *big* issue.

I think that the concept of a "seek" boundary is irrelevant; any 
potential dividing point satsifies the above criteria for concern.  A 
seek window is larger than the interblock window (which is still a 
possible split point), but smaller than a badblock forwarding window.

> This is also a general issue for ACL's, OS/2 Extended attributes, HFS
> resource forks, and any other construct that pushes metadata (potentially)
> over a seek boundry.  NTFS and VIVA are other examples that comes to mind.

So what's the "standard" solution?  Put serialisation markers in every 
physical block?  How would this deal with a synthetic volume using 
least-common-multiple aggregate blocks on base media with different 
block sizes?

> 
> > > > > Consider a FAT FS.  A FAT FS deals with 1K blocks.  But these 1K blocks
> > > > > are not constrained to start at an even offset from the start of the
> > > > > disk, only from an even cylinder boundry.
> > > > 
> > > > In the light of the nonexistence of "cylinders" in the proposed model, 
> > > > it strikes me that this becomes an issue of synthesising a conforming 
> > > > pseudo-geometry at filesystem creation time, and little more.  
> > > > Compatability is likely an issue there.
> > > 
> > > Page alignment must equal 1k block alignment for the 0th of 4 blocks;
> > > alternately, access will be... uh, slow.
> > 
> > I think you've left out enough context there that I can't approach any 
> > of that.  Unless you are referring to mmap() on such files.
> 
> No.  The mmap() issue is seperate.
> 
> Consider two 4k pages on disk:
> 
> [              4k              ][              4k              ]
> 
> This is made up of physical blocks:
> 
> [ 512b ][ 512b ][ 512b ][ 512b ][ 512b ][ 512b ][ 512b ][ 512b ]
> 
> These physical blocks are agregated, begining on a "cylinder boundry",
> to form "FAT blocks":
> 
> [ 512b ][      1k      ][      1k      ][      1k      ][ 512b ]
>                                ^
>                                |
>                             page boundry
> 
> Now we have an FAT "block" that spans a page boundry.
> 
> To access (read/write) the block, we must access two pages -- 8k.

This is where I'm not getting it.  To read the cluster (the 'FAT 
block'), one reads the required physical blocks, surely.  Or is this 
issue of 'page alignment' more to do with the cache?

mike