Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 13 Dec 1997 20:55:27 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        mike@smith.net.au (Mike Smith)
Cc:        bgingery@gtcs.com, hackers@FreeBSD.ORG
Subject:   Re: blocksize on devfs entries (and related)
Message-ID:  <199712132055.NAA29304@usr06.primenet.com>
In-Reply-To: <199712130848.TAA01888@word.smith.net.au> from "Mike Smith" at Dec 13, 97 07:18:36 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > Theoretically, the physical layout of the device should be stored
> > whether or not there's any filesystem on it.
> 
> This is a fundamentally flawed approach, and I am glad that Julian's 
> new SLICE model (at this stage) completely ignores any incidental 
> parametric information associated with an extent.

One "incidental" piece of parametric information I am interested in
seeing is the physical block size.

Consider the FFS directory management code.  It has knowledge of
physical blocks.  In fact, it can not easily handle a directory
block that is not exactly a physical block size.  The current code
can not be broken across block I/O's, nor can it handle partial
block I/O's well (there are a number of failure modes).

This becomes *very* important if you ever want to support Unicode,
which takes 2 characters per character, or EUC encoded "Big 5",
which may take up to 5 characters per character (one of the reasons
I am "for" Unicode and "against" EUC/ISO2022).

For you to be able to fit a maximally-sized file name in a single
directory block means your directory block must be over 512 bytes
in length.


> Physical blocksize vs. logical blocksize is a problematic issue.  On 
> one hand, there is the desire to maintain simplicity by mandating a 
> single blocksize across all boundaries and forcing translation at the 
> lowest practical level.  The downside with this is dealing with legal 
> logical block operations that result in partial block operations at the 
> lowest level.
> 
> One approach with plenty of historical precedent is to use a blocksize 
> "sufficiently large" that it is a multiple of the likely device 
> blocksizes, and make that the 'uniform standard'.  Another is to 
> cascade blocksizes upwards, where the blocksize at a given point in the 
> tree is the lowest common multiple of that of all points below.  This 
> obviously requires some extra smarts in each layer that consumes 
> multiple lower layers.

Both of these are addressable using a logical block size, and hiding
the block boundries, as necessary, for the upper level code.

Consider a putative "audio read/write CD FS", where the block size is
not an integer factor of the page size.  How does one MMAP files from
such a beast?


> Incorrect.  It is relatively straightforward to create a vnode disk, 
> slice it, build a FAT filesystem in one slice and then pass that slice 
> to your favorite PC emulator.

I believe there should be an INT 21 redirector in the PC emulators to
allow them to use any VFS disk to which the process they are running
in has legitimate access.


> >   Yet, why deny these the optimization information which will allow
> >   them to map (within the constraints of their architecture) a new
> >   filesystem for best throughput, if it's actually available.
> 
> Because any "optimisation information" that you could pass them would 
> be wrong.  Any optimisation attempting to operate based on incorrect 
> parameters can only be a pessimisation. 

This is not strictly true.  I have been arguing for a long time for
the use of the 8 "partial page bits" within the kernel.

Consider a FAT FS.  A FAT FS deals with 1K blocks.  But these 1K blocks
are not constrained to start at an even offset from the start of the
disk, only from an even cylinder boundry.

This means that 512b of the 1k block can be at the end of one page, and
the 512b for the rest of the block can be at the beginning of another.

It makes a hell of a lot of sense to be able to read only 1k of data
from the disk instead of 16k.  Especially in light of your arguments
about not saving optimization information, such as cylinder boundries.
A 16k read is 16 times more likely to result in a "hidden" seek in
your model.

> >    With what we're all doing today, it seems that taking a certain
> >    number of cylinders for slices is best - but other access methods
> >    may find an underlying physical structure more convenient if
> >    a slice specifies a range of heads and cylinders that do NOT
> >    presume that all heads/cylinders from starting to ending according
> >    to physical layout are part of the same slice.  It may be quite
> >    convenient to have a cluster of heads across physical devices
> >    forming a logical device or slice, without fully dedicating those
> >    physical devices to that use.
> 
> This is a nonsense question in the context of ZBR and "logical extent" 
> devices (eg. SCSI, ATAPI, most ATA devices).

There are a number of SCSI devices that support independent head seeks
per platter (not nearly as many as the non-SCSI big iron IBM/DEC drives,
but they do exist).

Similarly, for single seek controlling multiple heads, any block which
crosses a platter * sector_size boundry could result in another seek.

Admittedly, the clustering wins are much larger than the geometry wins,
but the geometry wins are non-zero.  In addition, even ZBR SCSI II
disks can return the actual breakdown of sector counts within each zone
(in the extended mode page).  So the data *is* available on newer disks;
it's the older SCSI I disks that lack this information.

> >        And, I'll mention again, DISK formats are not the only
> >    random-access mass-storage formats on the horizon!  I'm guessing
> >    that for speed of inclusion into product lines, all will emulate
> >    a disk drive - but that may not be the most efficient way of using
> >    them (in fact, probably not).  They also can be expected to have
> >    "direct access" methods according to their physical architecture,
> >    with some form of tree-access the MOST efficient!
> 
> In most cases, the internal architecture of the device will be 
> optimised for two basic operations; retrieval of large contiguous 
> extents, and read/write of small randomly scattered regions.

Consider the use of static column DRAM for the implementation of such a
device.  Column access after single element access is many times
faster.  IBM, Sanyo-ICON, and several others used this technique (a
former supervisor of mine holds the patent for a zero cycle latency
L2 cache used in the Sanyo-ICON machines using these devices).


[ ... device arrival/departure events ... ]

> This is trivially obvious, and forms the basic argument for the use of 
> DEVFS.  You fail to draw the parallel between system startup and the 
> conceptual "massive arrival of devices" which is still the major 
> argument for such a system.

The events should, ideally, be propagated out of the devfs framework,
IMO.  I'm not sure Julian agrees with me on this.  This would mean that
after you ask the slice handlers if they want the device, and they all
say "no", it's a terminal device.  But the event should keep going.
I think that file system mounts should result from arrival events.  This
is actually the basis of my argument that the mounting of a device should
be seperate from the mapping of the device into the FS hierarchy.  If all
devices could be mounted, even though you don't know their mapping into
the hierarchy, then an event could result from a successful mount, and
be propagated to a hierarchy mapping agent.  At this point, fstab is a
means of dictating mapping.  To a large extent it is unnecessary to
have an fstab: one could use the "last mounted on" field for this data,
and ignore the "/mnt" FS's (leaving them unmapped), or even modify the
newfs to stuff the mountpoint into the "last mounted on" field.

Ah, brave new world, that has such thing in't.  8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199712132055.NAA29304>