From owner-freebsd-arch Thu Oct 14 18:22:23 1999 Delivered-To: freebsd-arch@freebsd.org Received: from ns1.yes.no (ns1.yes.no [195.204.136.10]) by hub.freebsd.org (Postfix) with ESMTP id 9E98114E09 for ; Thu, 14 Oct 1999 18:22:01 -0700 (PDT) (envelope-from eivind@bitbox.follo.net) Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218]) by ns1.yes.no (8.9.3/8.9.3) with ESMTP id DAA19502 for ; Fri, 15 Oct 1999 03:21:58 +0200 (CEST) Received: (from eivind@localhost) by bitbox.follo.net (8.8.8/8.8.6) id DAA46441 for freebsd-arch@freebsd.org; Fri, 15 Oct 1999 03:21:56 +0200 (MET DST) Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id 52D9314E09 for ; Thu, 14 Oct 1999 18:21:18 -0700 (PDT) (envelope-from tlambert@usr01.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id SAA20382; Thu, 14 Oct 1999 18:21:07 -0700 (MST) Received: from usr01.primenet.com(206.165.6.201) via SMTP by smtp04.primenet.com, id smtpdAAAbpaaVN; Thu Oct 14 18:21:01 1999 Received: (from tlambert@localhost) by usr01.primenet.com (8.8.5/8.8.5) id SAA20401; Thu, 14 Oct 1999 18:21:04 -0700 (MST) From: Terry Lambert Message-Id: <199910150121.SAA20401@usr01.primenet.com> Subject: Re: The eventual fate of BLOCK devices. To: julian@whistle.com (Julian Elischer) Date: Fri, 15 Oct 1999 01:21:04 +0000 (GMT) Cc: freebsd-arch@freebsd.org, mckusick@mckusick.com In-Reply-To: from "Julian Elischer" at Oct 14, 99 04:56:25 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Here's an argument I haven't heard before, and then a response to Julian: Question 1: How will I netboot my non-FreeBSD OS that requires block devices using an NFS mounted / containing that OS's /dev, if FreeBSD can not support block devices in its FS? Question 2: If block device nodes are still allowed in the FS on a FreeBSD box, what's the point of not allowing variant behaviour based on an implied ioctl in the open routine for 'b' vs. 'c' nodes? Question 3: If a single if test based on the 'b'-ness of a device is allowed at open time (no real performance penalty for operations against already open 'c' devices), why should I not be able to call variant code (e.g. ioctl)? Question 4: If the arguemnt is simply against the variant code being in the kernel by default, then why not permit it in a kernel module, which need not be loaded by default (or could be loaded by default, except on people who hate block devices systems)? It seem logical to me to allow legacy systems to netboot, and therefore block device nodes to be created, and therefore, since it can't hurt to have code variant on 'b'-ness, if 'b'-ness is never used by people who don't like it, check if a function pointer is null, and fail as if block devices are not supported, and have a kernel module that sets the pointer non-null when it is loaded. [ ... in response to Julian ... ] > > This is an issues, since not all data is 2048 byte blocks, but > > can in fact be 2352, or a physical sector size of 2048, 2336, or > > 2340 bytes. This will only get more complicated as DVD and other > > standards evolve and come online. > > I'm not sure this would work anyhow ans I think our bufferring code may > explode with non binary blocksizes. You also don't want to flush your vm > buffers with some top-40 song.. :-) That depends. I might, if I were mastering a CD on a read/write device before burning it. Not to say that the FS that can do this currently exists, but the change suggested would certainly preclude it ever existing. > > I would argue that such database software is either broken, or > > it is expecting a broken kernel (one which does not do the correct > > thing on block device descriptors marked O_SYNC -- such as FreeBSD's > > existing block device semantics). > > I think this was the reason for John Dyson's async IO stuff. > Does it work as expected on raw devices? I presume so. My point was that it can be made to report errors correctly by using the correct open mode, and by fixing FreeBSD to honor that open mode, not that the concept was broken, but that the applications use of the devices without the concept in force was broken. > > > Terry argues for retaining the bdev semantics rather than the cdev > > > semantics, but I think we can dismiss that idea based on the above > > > observation: it would penalize software which know better. Retaining > > > the bdev would in essence be emulating the mistake Linux made, and > > > which they are now unmaking. > > > > I think that for "software that knows better", i.e. software that > > has called fstat(2) to get st_blksize, and intentionally performs > > aligned writes, that it would be trivial to determine if a write > > was on a block boundary, and spanned an integer number of blocks, > > and therefore not penalize the smarter software. This is really > > an implementation issue, not a performance issue. > > It doesn't help if what you are trying to do is use the buffer cache to > cache.. especially if you are doing so , so that different processes can > benefit from each other's cach filling activities. The data will still be in buffer cache; it will just be hung off the device vnode. > > I believe the tools should be implemented via a different API, > > since the kernel already knows about slices, partitions, etc., > > and has to have that knowledge embedded in it. So either way, > > the tools promiscuous knowledge of stuff that they really have no > > right knowing in the first place isn't an argument for getting > > rid of block devices -- nor an argument in favor of keeping them. > > That's a whole different question, and having tried to implement it I know > it has its own pitfalls. You end up having to know some insestuous > information no matter what you do. Having implemented precisely this on SVR4, I don't see the pitfalls to which you are referring. A single ioctl() to ask about the available partitioning methods, including the one preferred by the device, another to get available space & size & allowable count, etc., and another to instantiate or deinstatiate an instance based on a variable length array, based on the actual count. The idea of slicing things up into subregions, region contiguity and overlap rules, etc., can all be made abstract via an API; it's quite trivial. It only gets complicated if you try and write the information on raw partitions from user space, and under an API based scheme, that's not an allowable operation. > > As Julian didn't point out, but probably meant to with his example, > > Multiple fromas operating at a granularity of sizeof(struct foo), > > where sizeof(struct foo) is not an integer multiple of the underlying > > device block size, will havve to have some form of promiscuous IPC > > mechanism to communicate with each other. > > > > "fromas"? I assume that was a neural spasm attacking while the fingers > were being ordered to type "processes". No, it was an editor esacpe character timeout glitch over a slow link, and yes, it was supposed to be "processes". > Well they wouldn't need it if htey were working through a coherent > cache system (and most of them were readers). For files, this is true, for character devices, this is not true. Say I have a structure of 32 bytes in length, and I want to write the third one in the file. The unavailability of the block device means that I must find out the block size, get a buffer of that size, read in the first block, modify the data starting 64 bytes into the buffer for 32 bytes, and then write out the whole block. This means that even if I have a locking mechanism (e.g. Sybase's) that lets me lock the third block, someone may race me to another structure in the block, and then I will overwrite their data with the previous contents. User buffering for sub-block boundaries is unacceptable in a multiprocess environment. Additionally, this seems to be counter-intuitive, if only from first principles: I am supposed to be able to treat devices as I can any other object in the filesystem for which I have read or write permission, when I am reading or writing. I believe there is a seperate limitation, as well, which was stated in the orginal design goals (and may be in POSIX), which is to say that block devices are seekable. [ ... an ioctl() to turn on block buffering ... ] > > I would not object to removal of the block devices (except on > > standards conformance based grounds), if it were guaranteed > > that such an ioctl() would be implemented before their removal, > > and that a user desiring to do so could override the "MAKEDEV" > > to create "block" devices, on which this ioctl() call was implicitly > > called on open. > > We are adding more and more complexity back.. Not really. The complexity argument is (supposedly) based on the coherency arguemnt, not the amount of code argument. As I satated before, you could mute the "amount of code" argument simply by placing the implementation code into a kernel module. As for complexity in bdevsw[] vs. cdevsw[], apart from the fact that both are long due for pasture based on a devfs or similar soloution, it would be easy to declare that, on a system where the module was loaded, access to a block device inode was the same as access to a character device inode + the ioctl(), and there is a 1:1 correspondance between block and character devices in this world (much of the complexity argument appears to depend on the desynchronization of these tables, and the idea that both tables must, for some unknown reason, be somehow maintained in any implementation). > > I think a character device that allowed block semantics, but would > > discard cache buffers if accessed on block boundaries would equally > > suffice to address the issue of unification of the block and character > > device namespace, which I think is the real issue here. > > Unless you were looking for it to KEEP the information to speed up the > next access. The blocks *won't* be cached off the vnode for a character device? This flies in the face of reason... not to mention file system performance. I don't believe it. > > However, an ioctl() based soloution, with a compatability mode which > > is not enabled by default (but must be capable of being soft-enabled) > > would suffice. > > Once again. it would solve a subset of the needs. With a compatability mode, you would not know the difference, so long as the module was loaded, and the devices created. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message