Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 14 Oct 1999 16:56:25 -0700 (PDT)
From:      Julian Elischer <julian@whistle.com>
To:        freebsd-arch@freebsd.org
Cc:        mckusick@mckusick.com
Subject:   Re: The eventual fate of BLOCK devices.
Message-ID:  <Pine.BSF.4.10.9910141621480.17468-100000@current1.whistle.com>
In-Reply-To: <199910142056.NAA29867@usr08.primenet.com>

next in thread | previous in thread | raw e-mail | index | archive | help


On Thu, 14 Oct 1999, Terry Lambert wrote:

> First of all, thanks to everyone for such a focussed discussion.
> 
> I have some comments on Poul's comments, and would at least like
> to argue for a "legacy mode", even if it is not enabled by default,
> so long as it can be enabled (and implied) without a kernel recompile
> (but perhaps requiring a kernel module), for standards compliance
> reasons, if no other.

I have mentionned before, and I will mention again, that if a standard
disk layer is implemented, then it would be quite easy to implement
a block buffered interface to the raw disks, purely within the disk layer.
I could imagine simply taking the highest bit in the minor number to 
indicate raw or blocked access. I could even ahppily see the next highest
bit being used to indicate write-through or write-back behaviour.
(there are 24 bits and if we probably don't need them all.)

This would remove all bdevs from the system and still give us 'buffered
disk devices if they were needed.

> 
> 
> 
> 
> I would add:
> 
>     4) Programs which want to treat object in the filespace as if
>        they were byte streams.

e.g. I have quite often used dd to extract the MBR table from the disk
with the command:

dd if=/dev/wd0 of=/tmp/table bs=1 skip=446 count=64

I could of course do this in 2 steps, first reading the block and then
extracting the bytes, but it's an example of something 'breaking'.

> 
> In other words, it shouldn't matter, and I should not have to
> give special arguments to programs such as "tar" and "dd" and "team"
> to do I/O in variable media blocking factors.
> 
> I think I would also add:
> 
>     5) Programs that have to deal with CDROM's containing multiple
>        sessions.
> 
> This is an issues, since not all data is 2048 byte blocks, but
> can in fact be 2352, or a physical sector size of 2048, 2336, or
> 2340 bytes.  This will only get more complicated as DVD and other
> standards evolve and come online.


I'm not sure this would work anyhow ans I think our bufferring code may
explode with non binary blocksizes. You also don't want to flush your vm
buffers with some top-40 song.. :-)

> 
> In addition, many WORM, mageneto-optical, and Japanese hard
> drives (such as those by default in the NEC PC-98) are 1024 bytes.
> 
> I would have that complexity hidden from the user, who is most
> interested in a linear array of bytes of arbitrary length, and
> in seeking to non-block aligned offsets in the linear array.

true, it would be nice..

> 
> 
> > Database software prefer cdev semantics if at all possible, if
> > running on anything but a cdev database software call fsync(2) a
> > lot to make sure the writes have hit the media.
> 
> I would argue that such database software is either broken, or
> it is expecting a broken kernel (one which does not do the correct
> thing on block device descriptors marked O_SYNC -- such as FreeBSD's
> existing block device semantics).

I think this was the reason for John Dyson's async IO stuff.
Does it work as expected on raw devices? I presume so.

> 
> 
> > Terry argues for retaining the bdev semantics rather than the cdev
> > semantics, but I think we can dismiss that idea based on the above
> > observation: it would penalize software which know better.  Retaining
> > the bdev would in essence be emulating the mistake Linux made, and
> > which they are now unmaking.
> 
> I think that for "software that knows better", i.e. software that
> has called fstat(2) to get st_blksize, and intentionally performs
> aligned writes, that it would be trivial to determine if a write
> was on a block boundary, and spanned an integer number of blocks,
> and therefore not penalize the smarter software.  This is really
> an implementation issue, not a performance issue.

It doesn't help if what you are trying to do is use the buffer cache to
cache.. especially if you are doing so , so that different processes can
benefit from each other's cach filling activities.

> 
> 
> > The filesystem maintenance applications mentioned so far which rely
> > on bdev semantics, the EXT2FS tools, can be trivially converted to
> > operate on cdev semantics.  The majority of such tools already
> > correctly operate on cdevs.
> 
> I believe the tools should be implemented via a different API,
> since the kernel already knows about slices, partitions, etc.,
> and has to have that knowledge embedded in it.  So either way,
> the tools promiscuous knowledge of stuff that they really have no
> right knowing in the first place isn't an argument for getting
> rid of block devices -- nor an argument in favor of keeping them.

That's a whole different question, and having tried to implement it I know
it has its own pitfalls. You end up having to know some insestuous
information no matter what you do.

> 
> 
> > Savecore(8) has already been converted to operate on cdevs.
> 
> Irrelevent, I think, as well.
> 
> Clearly, we could convert the entirety of all FTP'able software
> on the Internet to do its own block size determination, and
> do buffering in user space thereafter.  I think this would be
> wasteful.
> 
> As Julian didn't point out, but probably meant to with his example,
> Multiple fromas operating at a granularity of sizeof(struct foo),
> where sizeof(struct foo) is not an integer multiple of the underlying
> device block size, will havve to have some form of promiscuous IPC
> mechanism to communicate with each other.
>

"fromas"? I assume that was a neural spasm attacking while the fingers 
were being ordered to type "processes".

Well they wouldn't need it if htey were working through a coherent cache
system
(and most of them were readers).
 
> 
> Without buffering, a supra-record offset granularity would need
> to be maintained and communicated between multiple programs that
> are accessing the character device on non-block boundaries.
> 
> This is a can of worms.
> 
> 
> > Using mmap(2) to provide a new type of buffered semantics for
> > disk-like devices is insteresting, but its applicability will be
> > limited by the virtual address space of a process: you can't map
> > a 20GB database into a 32bit address space, so a lot of mmap(2)
> > calls will be needed for serious sized data.  The need for, and
> > actual use of such a facility seemes uncertain.
> 
> Agreed.

Also Mmap has disadvantages which plain old read/write get around.
you need to 'touch' a mmapped page to actually initiate the transfer which
may be  aslightly more complicated operation than one expects.



> 
> 
> > There is general disagreement about how much code we save, but
> > nobody disputes that we will be able to remove some amount of
> > complexity from the kernel.  Most people seem to overlook the
> > needlessly replicated code in a number of xxx(8) tools to DTRT with
> > /dev/foo vs /dev/rfoo.
> 
> I think if these tools are written to operate on the less limited
> block device, they should simply refuse to operate on the more
> limited character device.  This is an elegant soloution, and some
> message morally equivalent to "use the block device, dummy" would
> be adequate to get the user to do the right thing, rather than
> making up for the inadequacies of the user. (down that road lies
> ruin and "undelete" and "unnewfs").

I doubt that we would save much (already written) code froem the
tools. Some, but not an amount that would influence this decision.


> 
> 
> > Implementing an ioctl(2) to switch a disk-like device into bdev
> > mode is relatively trivial, but there currently seems to be no
> > point in doing so.

that's no real good from my perspective as you need to add extra code to
do the ioctl.. where did the 'simplification of code' go?

> 
> I think the point in doing this would be to ensure that code
> would not be broken by the OS, and could be forced to work.

Not without modification.

> 
> I would not object to removal of the block devices (except on
> standards conformance based grounds), if it were guaranteed
> that such an ioctl() would be implemented before their removal,
> and that a user desiring to do so could override the "MAKEDEV"
> to create "block" devices, on which this ioctl() call was implicitly
> called on open.

We are adding more and more complexity back..

> 
> This would certainly satisfy the "legacy/standards crowd", I
> think, while still allowing the surgery you want to perform.
> 
> 
> > There is a significant majority supporting the removal of bdev
> > semantics.

Actually I see it as "A few detirmined people (I count 3) for it, a few (I
also count 3) against it, and a whole pile of people who, judging by their
total lack of activity, couldn't care less".

> Majority is not a measure of technical merit.

Win95?


> 
> I think a character device that allowed block semantics, but would
> discard cache buffers if accessed on block boundaries would equally
> suffice to address the issue of unification of the block and character
> device namespace, which I think is the real issue here.

Unless you were looking for it to KEEP the information to speed up the
next access.

> 
> However, an ioctl() based soloution, with a compatability mode which
> is not enabled by default (but must be capable of being soft-enabled)
> would suffice.

Once again. it would solve a subset of the needs.

> 
> 
> > An ioctl(2) based mode-switch will only be implemented if a
> > very good reason for doing so materializes.
> 
> I think that the fact that we can't know about all software, and
> that the standards specify block devices, argues for some form
> of legacy support mechanism, even if it isn't enabled by default
> for FreeBSD systems.
> 
> 
> 					Terry Lambert
> 					terry@lambert.org
> ---
> Any opinions in this posting are my own and not those of my present
> or previous employers.
> 
> 
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-arch" in the body of the message
> 





To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.10.9910141621480.17468-100000>