Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 2 Aug 2012 13:02:40 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        arch@freebsd.org, David Xu <davidxu@freebsd.org>
Subject:   Re: short read/write and error code
Message-ID:  <20120802100240.GV2676@deviant.kiev.zoral.com.ua>
In-Reply-To: <20120802040542.G2978@besplex.bde.org>
References:  <5018992C.8000207@freebsd.org> <20120801071934.GJ2676@deviant.kiev.zoral.com.ua> <20120801183240.K1291@besplex.bde.org> <20120801162836.GO2676@deviant.kiev.zoral.com.ua> <20120802040542.G2978@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--3j1rCUNxv/n/LAQ6
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Aug 02, 2012 at 04:58:49AM +1000, Bruce Evans wrote:
> On Wed, 1 Aug 2012, Konstantin Belousov wrote:
>=20
> >On Wed, Aug 01, 2012 at 07:23:09PM +1000, Bruce Evans wrote:
> >>On Wed, 1 Aug 2012, Konstantin Belousov wrote:
> >>
> >>>On Wed, Aug 01, 2012 at 10:49:16AM +0800, David Xu wrote:
> >...
> >>The usability is specified for signals.  From an old POSIX draft:
> >>
> >>% 51235              If write( ) is interrupted by a signal before it
> >>writes any data, it shall return -1 with errno set to
> >>% 51236              [EINTR].
> >>% 51237              If write( ) is interrupted by a signal after it
> >>successfully writes some data, it shall return the
> >>% 51238              number of bytes written.
> >This is exactly what existing code does.
>=20
> Yes, this is one of the few cases that is completely specified (when
> complications for SA_RESTART are added).
>=20
> >>POSIX formally defines "Successfully Transferred", mainly for aio.  I
> >>couldn't find any formal definition of "successfully writes", but clear=
ly
> >>it is nonsense for a write to be unsuccessful if a reader on the local
> >>system or on an external system has successfully read some of the data
> >>written by the write.
> >>
> >>FreeBSD does try to convert EINTR to 0 after some data has been written,
> >>to conform to the above.  SIGPIPE should return EINTR to be returned to
> >>dofilewrite(), so there should be no problem for SIGPIPE.  But we were
> >>reminded of this old FreeBSD bug by probelms with SIGPIPE.
> >Sorry, I do not understand this, esp. second sentence.
>=20
> Oops, the second sentence has bad grammar.  I meant "generation of
> SIGPIPE should cause EINTR to be returned to dofilewrite(), so...".
> What actually happens is that lower-level code returns EPIPE, without
> generating EINTR, to dofilewrite() (except socket code generates
> SIGPIPE directly unless this is disabled by a socket option).  Then
> dofilewrite() first fails to see the EINTR because the return code
> is EPIPE (this happens even for the socket case).  Then dofilewrite()
> generates SIGPIPE, again without changing the error code to EINTR or
> looping back to pick up the special EINTR handling.  This is actually
> correct -- the error is specified to be EPIPE, not EINTR.  (This is
> similar to what happens for the internally-generated SIGXFSZ -- there
> is a not-so-special error code for that.)
Um, no, I do not see how EINTR could be referenced in this context at all.
SUSv4 says quite unambigous that "[EPIPE] An attempt is made to
write to a pipe or FIFO that is not open for reading by any process, or
that only has one end open. A SIGPIPE signal shall also be sent to the
thread."

Indeed, POSIX does not specify that we cannot return non-zero (success
write) and deliver SIGPIPE, but this is just insane. If SIGPIPE handler
returns, the return value from write(2) must be -1 and errno set to
EPIPE. For naive programs, which are not aware that stdout can be
pipe (i.e. the original target audience of SIGPIPE) not delivering the
signal on write(2) which write anything is just fine, since we can
make a valid assumption that they would repeat the write, and then
get EPIPE/SIGPIPE as intended.

Anyway, as I said, I very much dislike making the generic I/O layer
decide after the filesystem code, and limiting its ability to report
errors.

I believe that all what is needed is the patch below, which just fixes
a bug in pipe code (obvious after all the discussions). I inspected the
pipe_read() and it seems to behave correctly without any change.
>=20
> So the above is not completely specified after all.  EINTR is not
> intended to apply when the signal is internally generated (and you
> could argue that the write is not interrupted by a signal in that
> case; instead, the write fails and generates the signal, which is how
> this is implemented).  Then since it wasn't interrupted, the clause
> about returning the amount successfully written doesn't apply either.
> And since we are failing with error code EPIPE, we aren't returning
> the amount successfully written.  So it is apparently up to the
> lower-level pipe code to not generate EPIPE if it has written
> something sucessfully.  The wording for most of the specication of
> EPIPE suggests that this error is for when a there is no reader at the
> time the write() starts.
Exactly, this is what I mentioned to David as an alternative approach for
the fix, which does not cause screw-up of the generic I/O layer. I just
went ahead and implemented the fix, see the patch at the end of message.
It was tested with David program.

>=20
> >>>Then fix the pipe code, and not introduce the behaviour change for all
> >>>file types ?
> >>
> >>Because returning the error to userland breaks all file types that
> >>want to return a short i/o (mainly special files whose i/o cannot be
> >>backed out of).  They are just detecting and returning an error as a
> >>courtesy to upper layers, and to simplify the implementation.  The
> >>syscall API doesn't permit returning both the error code (the reason
> >>for the short i/o) and the short count, so the error code must be
> >>cleared to allow the short count to be returned.
> >No, there is the only sane behaviour for the fo_read and fo_write, to
> >return either no error (or interruption error) and adjust resid, or
> >return error. Returning both error and adjusting resid is just wrong.
>=20
> It is the FreeBSD API.
I do not understand this comment. We have a flexibility in kernel, and why
should we hack-patch the wrong behaviour of lower layers instead of fixing
bugs ?

The 'The FreeBSD KPI' there is (or rather, should be):
- return changed resid and set error to 0
- return non-zero error, which causes generic layer to ignore resid changes.
Such KPI provides the maximal freedom for the lower layer, without requiring
filesystem to try to implement impossible things like uio rollback.

diff --git a/sys/kern/sys_pipe.c b/sys/kern/sys_pipe.c
index 338256c..1a61b93 100644
--- a/sys/kern/sys_pipe.c
+++ b/sys/kern/sys_pipe.c
@@ -1286,13 +1286,13 @@ pipe_write(fp, uio, active_cred, flags, td)
 	}
=20
 	/*
-	 * Don't return EPIPE if I/O was successful
+	 * Don't return EPIPE if any byte was written.
+	 * EINTR and other interrupts are handled by generic I/O layer.
+	 * Do not pretend that I/O succeeded for obvious user error
+	 * like EFAULT.
 	 */
-	if ((wpipe->pipe_buffer.cnt =3D=3D 0) &&
-	    (uio->uio_resid =3D=3D 0) &&
-	    (error =3D=3D EPIPE)) {
+	if (uio->uio_resid !=3D orig_resid && error =3D=3D EPIPE)
 		error =3D 0;
-	}
=20
 	if (error =3D=3D 0)
 		vfs_timestamp(&wpipe->pipe_mtime);

--3j1rCUNxv/n/LAQ6
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAlAaUEAACgkQC3+MBN1Mb4iF0wCfQXXUtGFHS568+KnTpNyEbO9Q
PjcAnRavk88onkBZ1IFASuWVqYvFoMGg
=ytEu
-----END PGP SIGNATURE-----

--3j1rCUNxv/n/LAQ6--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120802100240.GV2676>