From owner-freebsd-arch  Thu Oct 14 18:22:23 1999
Delivered-To: freebsd-arch@freebsd.org
Received: from ns1.yes.no (ns1.yes.no [195.204.136.10])
	by hub.freebsd.org (Postfix) with ESMTP id 9E98114E09
	for <freebsd-arch@freebsd.org>; Thu, 14 Oct 1999 18:22:01 -0700 (PDT)
	(envelope-from eivind@bitbox.follo.net)
Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218])
	by ns1.yes.no (8.9.3/8.9.3) with ESMTP id DAA19502
	for <freebsd-arch@freebsd.org>; Fri, 15 Oct 1999 03:21:58 +0200 (CEST)
Received: (from eivind@localhost)
	by bitbox.follo.net (8.8.8/8.8.6) id DAA46441
	for freebsd-arch@freebsd.org; Fri, 15 Oct 1999 03:21:56 +0200 (MET DST)
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
	by hub.freebsd.org (Postfix) with ESMTP id 52D9314E09
	for <freebsd-arch@FreeBSD.ORG>; Thu, 14 Oct 1999 18:21:18 -0700 (PDT)
	(envelope-from tlambert@usr01.primenet.com)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.9.3/8.9.3) id SAA20382;
	Thu, 14 Oct 1999 18:21:07 -0700 (MST)
Received: from usr01.primenet.com(206.165.6.201)
 via SMTP by smtp04.primenet.com, id smtpdAAAbpaaVN; Thu Oct 14 18:21:01 1999
Received: (from tlambert@localhost)
	by usr01.primenet.com (8.8.5/8.8.5) id SAA20401;
	Thu, 14 Oct 1999 18:21:04 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199910150121.SAA20401@usr01.primenet.com>
Subject: Re: The eventual fate of BLOCK devices.
To: julian@whistle.com (Julian Elischer)
Date: Fri, 15 Oct 1999 01:21:04 +0000 (GMT)
Cc: freebsd-arch@freebsd.org, mckusick@mckusick.com
In-Reply-To: <Pine.BSF.4.10.9910141621480.17468-100000@current1.whistle.com> from "Julian Elischer" at Oct 14, 99 04:56:25 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Here's an argument I haven't heard before, and then a response to
Julian:


Question 1:	How will I netboot my non-FreeBSD OS that
		requires block devices using an NFS mounted
		/ containing that OS's /dev, if FreeBSD can
		not support block devices in its FS?

Question 2:	If block device nodes are still allowed in
		the FS on a FreeBSD box, what's the point of
		not allowing variant behaviour based on an
		implied ioctl in the open routine for 'b'
		vs. 'c' nodes?

Question 3:	If a single if test based on the 'b'-ness of
		a device is allowed at open time (no real
		performance penalty for operations against
		already open 'c' devices), why should I not
		be able to call variant code (e.g. ioctl)?

Question 4:	If the arguemnt is simply against the variant
		code being in the kernel by default, then why
		not permit it in a kernel module, which need
		not be loaded by default (or could be loaded
		by default, except on people who hate block
		devices systems)?


It seem logical to me to allow legacy systems to netboot, and
therefore block device nodes to be created, and therefore, since
it can't hurt to have code variant on 'b'-ness, if 'b'-ness is
never used by people who don't like it, check if a function
pointer is null, and fail as if block devices are not supported,
and have a kernel module that sets the pointer non-null when it
is loaded.


[ ... in response to Julian ... ]

> > This is an issues, since not all data is 2048 byte blocks, but
> > can in fact be 2352, or a physical sector size of 2048, 2336, or
> > 2340 bytes.  This will only get more complicated as DVD and other
> > standards evolve and come online.
> 
> I'm not sure this would work anyhow ans I think our bufferring code may
> explode with non binary blocksizes. You also don't want to flush your vm
> buffers with some top-40 song.. :-)

That depends.  I might, if I were mastering a CD on a read/write
device before burning it.

Not to say that the FS that can do this currently exists, but
the change suggested would certainly preclude it ever existing.


> > I would argue that such database software is either broken, or
> > it is expecting a broken kernel (one which does not do the correct
> > thing on block device descriptors marked O_SYNC -- such as FreeBSD's
> > existing block device semantics).
> 
> I think this was the reason for John Dyson's async IO stuff.
> Does it work as expected on raw devices? I presume so.

My point was that it can be made to report errors correctly by using
the correct open mode, and by fixing FreeBSD to honor that open
mode, not that the concept was broken, but that the applications
use of the devices without the concept in force was broken.


> > > Terry argues for retaining the bdev semantics rather than the cdev
> > > semantics, but I think we can dismiss that idea based on the above
> > > observation: it would penalize software which know better.  Retaining
> > > the bdev would in essence be emulating the mistake Linux made, and
> > > which they are now unmaking.
> > 
> > I think that for "software that knows better", i.e. software that
> > has called fstat(2) to get st_blksize, and intentionally performs
> > aligned writes, that it would be trivial to determine if a write
> > was on a block boundary, and spanned an integer number of blocks,
> > and therefore not penalize the smarter software.  This is really
> > an implementation issue, not a performance issue.
> 
> It doesn't help if what you are trying to do is use the buffer cache to
> cache.. especially if you are doing so , so that different processes can
> benefit from each other's cach filling activities.

The data will still be in buffer cache; it will just be hung
off the device vnode.


<TANGENT>

> > I believe the tools should be implemented via a different API,
> > since the kernel already knows about slices, partitions, etc.,
> > and has to have that knowledge embedded in it.  So either way,
> > the tools promiscuous knowledge of stuff that they really have no
> > right knowing in the first place isn't an argument for getting
> > rid of block devices -- nor an argument in favor of keeping them.
> 
> That's a whole different question, and having tried to implement it I know
> it has its own pitfalls. You end up having to know some insestuous
> information no matter what you do.

Having implemented precisely this on SVR4, I don't see the pitfalls
to which you are referring.

A single ioctl() to ask about the available partitioning methods,
including the one preferred by the device, another to get available
space & size & allowable count, etc., and another to instantiate or
deinstatiate an instance based on a variable length array, based on
the actual count.

The idea of slicing things up into subregions, region contiguity
and overlap rules, etc., can all be made abstract via an API; it's
quite trivial.

It only gets complicated if you try and write the information on
raw partitions from user space, and under an API based scheme,
that's not an allowable operation.

</TANGENT>


> > As Julian didn't point out, but probably meant to with his example,
> > Multiple fromas operating at a granularity of sizeof(struct foo),
> > where sizeof(struct foo) is not an integer multiple of the underlying
> > device block size, will havve to have some form of promiscuous IPC
> > mechanism to communicate with each other.
> >
> 
> "fromas"? I assume that was a neural spasm attacking while the fingers 
> were being ordered to type "processes".

No, it was an editor esacpe character timeout glitch over a
slow link, and yes, it was supposed to be "processes".


> Well they wouldn't need it if htey were working through a coherent
> cache system (and most of them were readers).

For files, this is true, for character devices, this is not true.

Say I have a structure of 32 bytes in length, and I want to write
the third one in the file.  The unavailability of the block device
means that I must find out the block size, get a buffer of that
size, read in the first block, modify the data starting 64 bytes
into the buffer for 32 bytes, and then write out the whole block.

This means that even if I have a locking mechanism (e.g. Sybase's)
that lets me lock the third block, someone may race me to another
structure in the block, and then I will overwrite their data with
the previous contents.

User buffering for sub-block boundaries is unacceptable in a
multiprocess environment.

Additionally, this seems to be counter-intuitive, if only from
first principles: I am supposed to be able to treat devices as
I can any other object in the filesystem for which I have read
or write permission, when I am reading or writing.

I believe there is a seperate limitation, as well, which was
stated in the orginal design goals (and may be in POSIX), which
is to say that block devices are seekable.


[ ... an ioctl() to turn on block buffering ... ]

> > I would not object to removal of the block devices (except on
> > standards conformance based grounds), if it were guaranteed
> > that such an ioctl() would be implemented before their removal,
> > and that a user desiring to do so could override the "MAKEDEV"
> > to create "block" devices, on which this ioctl() call was implicitly
> > called on open.
> 
> We are adding more and more complexity back..

Not really.  The complexity argument is (supposedly) based on
the coherency arguemnt, not the amount of code argument.  As I
satated before, you could mute the "amount of code" argument
simply by placing the implementation code into a kernel module.

As for complexity in bdevsw[] vs. cdevsw[], apart from the fact
that both are long due for pasture based on a devfs or similar
soloution, it would be easy to declare that, on a system where
the module was loaded, access to a block device inode was the
same as access to a character device inode + the ioctl(), and
there is a 1:1 correspondance between block and character
devices in this world (much of the complexity argument appears
to depend on the desynchronization of these tables, and the
idea that both tables must, for some unknown reason, be somehow
maintained in any implementation).


> > I think a character device that allowed block semantics, but would
> > discard cache buffers if accessed on block boundaries would equally
> > suffice to address the issue of unification of the block and character
> > device namespace, which I think is the real issue here.
> 
> Unless you were looking for it to KEEP the information to speed up the
> next access.

The blocks *won't* be cached off the vnode for a character
device?  This flies in the face of reason... not to mention
file system performance.  I don't believe it.


> > However, an ioctl() based soloution, with a compatability mode which
> > is not enabled by default (but must be capable of being soft-enabled)
> > would suffice.
> 
> Once again. it would solve a subset of the needs.

With a compatability mode, you would not know the difference, so
long as the module was loaded, and the devices created.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message