Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 13 Dec 1997 19:18:36 +1030
From:      Mike Smith <mike@smith.net.au>
To:        bgingery@gtcs.com
Cc:        hackers@FreeBSD.ORG
Subject:   Re: blocksize on devfs entries (and related) 
Message-ID:  <199712130848.TAA01888@word.smith.net.au>
In-Reply-To: Your message of "Tue, 09 Dec 1997 15:09:42 PDT." <199712092209.PAA07923@home.gtcs.com> 

next in thread | previous in thread | raw e-mail | index | archive | help

I haven't noticed any commentary on this, Brian, so I thought I should 
raise a few points that you appear to have missed.

> Theoretically, the physical layout of the device should be stored
> whether or not there's any filesystem on it.

This is a fundamentally flawed approach, and I am glad that Julian's 
new SLICE model (at this stage) completely ignores any incidental 
parametric information associated with an extent.

> To me some answers to these ...
> 
>      1.  physical block/sector size needs to be stored by DEVICE
>         this may or may not match the logical blocksize of any
>         filesystem resident on the device.  Optimal transfer blocksize
>         for each of read and write ALSO need to be stored.

Physical blocksize vs. logical blocksize is a problematic issue.  On 
one hand, there is the desire to maintain simplicity by mandating a 
single blocksize across all boundaries and forcing translation at the 
lowest practical level.  The downside with this is dealing with legal 
logical block operations that result in partial block operations at the 
lowest level.

One approach with plenty of historical precedent is to use a blocksize 
"sufficiently large" that it is a multiple of the likely device 
blocksizes, and make that the 'uniform standard'.  Another is to 
cascade blocksizes upwards, where the blocksize at a given point in the 
tree is the lowest common multiple of that of all points below.  This 
obviously requires some extra smarts in each layer that consumes 
multiple lower layers.

>      2.  physical layout (sect/track, tracks/cyl) also needs to
>         be stored for any DASD.  Also any OTHER known info which
>         may be used to optimize the filesystem building process for
>         the device, such as rotational speed, seek timing ..  If
>         this is not stored with driver info in the devfs, then
>         some pointer or common reference point should be made to
>         the "file entry" that contains the info.

Physical layout is a joke, and has been for many years.  This 
suggestion costs you a lot of credibility.

Qualitative parametric information may be useful, eg. "this disk is 
slow", presuming that a set of usefully general metrics can be 
established.  Unfortunately, obtaining measurements such as this can be 
slow, and the results are often nondeterministic.

>      3.  If at the controller level it is possible to concatinate
>         or RAID join devices, that information needs to be stored
>         for the device.  If this is intrinsic to the device driver
>         or the physical device - no matter.

This is not useful.  An upper layer should not care whether the extent 
it is consuming is a concatenation of extents.  This is an issue for 
management tools, which should have an OOB technique for recovering 
structure information.

>      6.  When a device is opened ro, if the underlying hardware has
>         ANY indication that it's a ro open, then if it is later upgraded
>         there should at least be a hook for it to be notified that it
>         has been upgraded.  Current state (ro/rw) should be avaialable
>         to user processes without "testing it by opening a write file"
>         to a filesystem (or even raw device). 

The RO->RW upgrade notification is a contentious issue, but one that 
definitely needs thinking through.  How would you suggest it be 
handled?  Should the standard be to reopen the device, or pass a 
special ioctl, or add a new device entrypoint?

>   Other thoughts.  Especially WRT possible experimental work, and
>   emulators, it will be QUITE convenient to have everything that can
>   be used to optimize the construction of a filesystem (of any of many
>   many kinds) or slice-out and construct a filesystem.  As wine, dosemu
>   and bochs (to just name three) expand the emulations supporting other
>   OSs, being free with filesystems for those OSs, other than purely
>   "native" becomes all the more important.

I can't actually parse this; I'm not sure if you're actually trying to 
say anything at all.

>   SoftPC/SoftWindows and Bochs both create internally what amounts to a
>   FAT filesystem within a file - a vnode filesystem, but not using
>   system provisions for it.  That pretty well eliminates "device" access
>   to the filesystem and (e.g.) doing a mount_msdos on 'em for other
>   processing and data exchange, without adapting the emulator's code
>   to *parallel* what we've already got with FreeBSD.

Incorrect.  It is relatively straightforward to create a vnode disk, 
slice it, build a FAT filesystem in one slice and then pass that slice 
to your favorite PC emulator.

>   Yet, why deny these the optimization information which will allow
>   them to map (within the constraints of their architecture) a new
>   filesystem for best throughput, if it's actually available.

Because any "optimisation information" that you could pass them would 
be wrong.  Any optimisation attempting to operate based on incorrect 
parameters can only be a pessimisation. 

>   Now let me raise some additional questions --
> 
> 
>        Should a DASD be mappable ONLY with horizontal slices?
>    With what we're all doing today, it seems that taking a certain
>    number of cylinders for slices is best - but other access methods
>    may find an underlying physical structure more convenient if
>    a slice specifies a range of heads and cylinders that do NOT
>    presume that all heads/cylinders from starting to ending according
>    to physical layout are part of the same slice.  It may be quite
>    convenient to have a cluster of heads across physical devices
>    forming a logical device or slice, without fully dedicating those
>    physical devices to that use.

This is a nonsense question in the context of ZBR and "logical extent" 
devices (eg. SCSI, ATAPI, most ATA devices).

>        And, I'll mention again, DISK formats are not the only
>    random-access mass-storage formats on the horizon!  I'm guessing
>    that for speed of inclusion into product lines, all will emulate
>    a disk drive - but that may not be the most efficient way of using
>    them (in fact, probably not).  They also can be expected to have
>    "direct access" methods according to their physical architecture,
>    with some form of tree-access the MOST efficient!

In most cases, the internal architecture of the device will be 
optimised for two basic operations; retrieval of large contiguous 
extents, and read/write of small randomly scattered regions.

Data access patterns are unlikely to change radically, particularly 
given the momentum that modern systems have.  I'll let you work out 
what the two above are, and why they are so common.  But trust me, they 
are.

>        Finally - one of the most powerful potentials of the devfs is
>    handling non-DASD devices!  The connecting or turning-on of a device
>    (nic/fax/printer/external-modem/scanner/parallel-to-parallel conn-  
>    ection to another PC, even industrial controls of some kind) SHOULD
>    cause it to "arrive".  If its turn-on generates a signal that can be
>    caught by a minimal driver, that may trigger a load of a full driver
>    (arrival event) and its inclusion in the devfs listings.  Similarly,
>    killing such a device might trigger an immediate or delayed unloading
>    of the same driver, and removal from the devfs.

This is trivially obvious, and forms the basic argument for the use of 
DEVFS.  You fail to draw the parallel between system startup and the 
conceptual "massive arrival of devices" which is still the major 
argument for such a system.

mike





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199712130848.TAA01888>