Date: Mon, 11 Oct 1999 20:51:27 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: peter.jeremy@alcatel.com.au Cc: freebsd-arch@freebsd.org Subject: Re: The eventual fate of BLOCK devices. Message-ID: <199910112051.NAA03111@usr09.primenet.com> In-Reply-To: <99Oct11.124046est.40329@border.alcanet.com.au> from "Peter Jeremy" at Oct 11, 99 12:44:30 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> > There are several trivial ways to solve the write-error problem. > > Unless I've missed something along the way, it's not that simple. > Traditionally, a write to a block device sits in the buffer cache > until the sync daemon flushes it to disk. I thought that unifying VM > and the buffer cache, together with softupdates, would have relatively > little impact on this - there's still a write-back cache with the > cache flushing occurring asynchronously and independent of the writer. Not to butt in, but O_FSYNC and IO_SYNC should handle this, and if they don't they need to be fixed, rather than glossed over by sweeping block devices under the rug. More generally, it'd be nice to adhere to standards, if only because of ABI compatability concerns for "emulation" modules. Standard UNIX (Single UNIX Specification) defines open(2) arguments: O_DSYNC Write operations to the descriptor are are synchronized data I/O completion. O_RSYNC Read I/O operations on the descriptor complete at the same level of integrity as specified by the O_DSYNC and O_SYNC flags. If both O_DSYNC and O_RSYNC are set in _oflag_ all I/O operations on the descriptor complete as defined by synchronized I/O data integrity completion. If both O_SYNC and O_RSYNC are set in _oflags_, all I/O operations on the descriptor complete as defined by synchronized I/O file integrity completion. O_SYNC Write I/O operations on the descriptor complete as defined by synchronized I/O file integrity completion. Note that the suggested ioctl(2) meshanism is inherently bogus, since the specification states that ioctl(2) can not operate on files. It would be nice to have a non-device-centric solution for this problem, instead of inventing new things (concerning this, I don't believe the idea that the future BSD installed base will be coming from a Linux background in any way justifies the repetition of the mistakes Linux has made). > > First, > > implement writes as write-through so a synchronous error can be returned. > > I would have thought that switching from the current write-back to a > write-through policy would, in general, entail a significant > performance hit. Even if filesystem I/O is excluded (which I believe > is the intent), you still lose the I/O clustering benefits. You are correct. However, I believe that it is the only way to do a correct implementation of the control flags. My take on this is that, in the absence of O_DSYNC and O_SYNC being specified in the open flags, that the policy should be write-back, and that it should require a user override (by specification of these flags at time of open(2) call) to invoke the performance hit (and the synchronous error reporting that appears to be desired by some parties). > > Second, implement an error code on close. > > This would seem to be preferable (and even POLA for direct I/O to > block devices). The problem I can see is firstly, the delays in > syncer and secondly, getting I/O errors from syncer to to process (or > processes, since several different processes could have written to the > block(s) in question) issuing the close. I believe the biggest issue here is actually "last close" vs. "any close" hooks into the device drivers. Both PHK and Bruce have noted this issue in the past. Perhaps it is time to deal with it, rather than talking about it. So far as assigning responsibility for the error, an error on close is only an issue in the non-O_DSYNC, non-O_SYNC case, i.e. when the user has elected ambiguity in the face of the performance hit that they would otherwise suffer. I think this can be ameliorated by implementing an error code on fsync(2), rather than one on close(2) (or in addition to; however it strikes me that the error on close is much less useful, in general, since it is very hard to take corrective action by the time you are calling close in the user space program). > > There are also situations where > > errors can be assumed to not occur, such as when using buffered VN > > partition which is backed by a file or swap. > > The device underlying the filesystem or swap could still suffer > errors. At some point, a decision needs to be made that the error is > not reported back to the caller, but notified to `the operator' as > a `system' error. In particular, the reason it is difficult for the soft updates code to enable or disable soft updates on the fly is that there are buffers off the device vnode, and there are buffers off the vnodes of the FS mounted on the device. This two-stage caching has three negative effects: (1) hard errors are not signalled until the underlying device vnode flush fails; a writethrough to the device buffers will never fail, and therefore never signal a user process, (2) uncommitted writes are, and remain, ambiguous, at least until the underlying vnode commit, and (3) there is an implicit limitation of the size of a device that can be accomodated through the interface, equal to the normal limit on the size of a file within a filesystem. I think that this is a more general problem, and is similar in scope and effect to the VM object coherency issues underlying VFS stacking. I think that a more general solution should be sought, with this in mind; the problem is going to have to be resolved eventaully, one way or the other, so ducking it now by eliminating block devices doesn't save any work in the long run: better to not procrastinate. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199910112051.NAA03111>