Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 11 Oct 1999 20:51:27 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        peter.jeremy@alcatel.com.au
Cc:        freebsd-arch@freebsd.org
Subject:   Re: The eventual fate of BLOCK devices.
Message-ID:  <199910112051.NAA03111@usr09.primenet.com>
In-Reply-To: <99Oct11.124046est.40329@border.alcanet.com.au> from "Peter Jeremy" at Oct 11, 99 12:44:30 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> >    There are several trivial ways to solve the write-error problem.
> 
> Unless I've missed something along the way, it's not that simple.
> Traditionally, a write to a block device sits in the buffer cache
> until the sync daemon flushes it to disk.  I thought that unifying VM
> and the buffer cache, together with softupdates, would have relatively
> little impact on this - there's still a write-back cache with the
> cache flushing occurring asynchronously and independent of the writer.

Not to butt in, but O_FSYNC and IO_SYNC should handle this, and
if they don't they need to be fixed, rather than glossed over
by sweeping block devices under the rug.

More generally, it'd be nice to adhere to standards, if only because
of ABI compatability concerns for "emulation" modules.

Standard UNIX (Single UNIX Specification) defines open(2) arguments:

	O_DSYNC		Write operations to the descriptor are
			are synchronized data I/O completion.

	O_RSYNC		Read I/O operations on the descriptor
			complete at the same level of integrity
			as specified by the O_DSYNC and O_SYNC flags.
			If both O_DSYNC and O_RSYNC are set in _oflag_
			all I/O operations on the descriptor complete
			as defined by synchronized I/O data integrity
			completion.  If both O_SYNC and O_RSYNC are
			set in _oflags_, all I/O operations on the
			descriptor complete as defined by synchronized
			I/O file integrity completion.

	O_SYNC		Write I/O operations on the descriptor
			complete as defined by synchronized I/O file
			integrity completion.

Note that the suggested ioctl(2) meshanism is inherently bogus,
since the specification states that ioctl(2) can not operate on
files.  It would be nice to have a non-device-centric solution
for this problem, instead of inventing new things (concerning
this, I don't believe the idea that the future BSD installed base
will be coming from a Linux background in any way justifies the
repetition of the mistakes Linux has made).


> >  First, 
> >  implement writes as write-through so a synchronous error can be returned.
> 
> I would have thought that switching from the current write-back to a
> write-through policy would, in general, entail a significant
> performance hit.  Even if filesystem I/O is excluded (which I believe
> is the intent), you still lose the I/O clustering benefits.

You are correct.  However, I believe that it is the only way to do
a correct implementation of the control flags.

My take on this is that, in the absence of O_DSYNC and O_SYNC being
specified in the open flags, that the policy should be write-back,
and that it should require a user override (by specification of these
flags at time of open(2) call) to invoke the performance hit (and
the synchronous error reporting that appears to be desired by some
parties).


> >    Second, implement an error code on close.
> 
> This would seem to be preferable (and even POLA for direct I/O to
> block devices).  The problem I can see is firstly, the delays in
> syncer and secondly, getting I/O errors from syncer to to process (or
> processes, since several different processes could have written to the
> block(s) in question) issuing the close.

I believe the biggest issue here is actually "last close" vs. "any
close" hooks into the device drivers.  Both PHK and Bruce have
noted this issue in the past.  Perhaps it is time to deal with it,
rather than talking about it.

So far as assigning responsibility for the error, an error on close
is only an issue in the non-O_DSYNC, non-O_SYNC case, i.e. when the
user has elected ambiguity in the face of the performance hit that
they would otherwise suffer.

I think this can be ameliorated by implementing an error code on
fsync(2), rather than one on close(2) (or in addition to; however
it strikes me that the error on close is much less useful, in
general, since it is very hard to take corrective action by the
time you are calling close in the user space program).


> > There are also situations where
> >    errors can be assumed to not occur, such as when using buffered VN
> >    partition which is backed by a file or swap.
> 
> The device underlying the filesystem or swap could still suffer
> errors.  At some point, a decision needs to be made that the error is
> not reported back to the caller, but notified to `the operator' as
> a `system' error.

In particular, the reason it is difficult for the soft updates code
to enable or disable soft updates on the fly is that there are
buffers off the device vnode, and there are buffers off the vnodes
of the FS mounted on the device.

This two-stage caching has three negative effects:  (1) hard errors
are not signalled until the underlying device vnode flush fails; a
writethrough to the device buffers will never fail, and therefore
never signal a user process, (2) uncommitted writes are, and
remain, ambiguous, at least until the underlying vnode commit, and
(3) there is an implicit limitation of the size of a device that
can be accomodated through the interface, equal to the normal limit
on the size of a file within a filesystem.

I think that this is a more general problem, and is similar in
scope and effect to the VM object coherency issues underlying VFS
stacking.  I think that a more general solution should be sought,
with this in mind; the problem is going to have to be resolved
eventaully, one way or the other, so ducking it now by eliminating
block devices doesn't save any work in the long run: better to
not procrastinate.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199910112051.NAA03111>