From owner-freebsd-arch@FreeBSD.ORG  Thu Aug  2 13:54:47 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id E5AB4106566C;
	Thu,  2 Aug 2012 13:54:47 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au
	[211.29.132.184])
	by mx1.freebsd.org (Postfix) with ESMTP id 63EC88FC0C;
	Thu,  2 Aug 2012 13:54:47 +0000 (UTC)
Received: from c122-106-171-246.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-246.carlnfd1.nsw.optusnet.com.au [122.106.171.246])
	by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q72Dsh61021282
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 2 Aug 2012 23:54:45 +1000
Date: Thu, 2 Aug 2012 23:54:43 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
In-Reply-To: <20120802100240.GV2676@deviant.kiev.zoral.com.ua>
Message-ID: <20120802222245.D2585@besplex.bde.org>
References: <5018992C.8000207@freebsd.org>
	<20120801071934.GJ2676@deviant.kiev.zoral.com.ua>
	<20120801183240.K1291@besplex.bde.org>
	<20120801162836.GO2676@deviant.kiev.zoral.com.ua>
	<20120802040542.G2978@besplex.bde.org>
	<20120802100240.GV2676@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org, David Xu <davidxu@freebsd.org>
Subject: Re: short read/write and error code
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Aug 2012 13:54:48 -0000

Please trime quotes!!

On Thu, 2 Aug 2012, Konstantin Belousov wrote:

> On Thu, Aug 02, 2012 at 04:58:49AM +1000, Bruce Evans wrote:
>> On Wed, 1 Aug 2012, Konstantin Belousov wrote:
>>
>>> On Wed, Aug 01, 2012 at 07:23:09PM +1000, Bruce Evans wrote:
>>>[>] ...
>>>> FreeBSD does try to convert EINTR to 0 after some data has been written,
>>>> to conform to the above.  SIGPIPE should return EINTR to be returned to
>>>> dofilewrite(), so there should be no problem for SIGPIPE.  But we were
>>>> reminded of this old FreeBSD bug by probelms with SIGPIPE.
>>> Sorry, I do not understand this, esp. second sentence.
>>
>> Oops, the second sentence has bad grammar.  I meant "generation of
>> SIGPIPE should cause EINTR to be returned to dofilewrite(), so...".
>> What actually happens is that lower-level code returns EPIPE, without
>> generating EINTR, to dofilewrite() (except socket code generates
>> SIGPIPE directly unless this is disabled by a socket option).  Then
>> dofilewrite() first fails to see the EINTR because the return code
>> is EPIPE (this happens even for the socket case).  Then dofilewrite()
>> generates SIGPIPE, again without changing the error code to EINTR or
>> looping back to pick up the special EINTR handling.  This is actually
>> correct -- the error is specified to be EPIPE, not EINTR.  (This is
>> similar to what happens for the internally-generated SIGXFSZ -- there
>> is a not-so-special error code for that.)
> Um, no, I do not see how EINTR could be referenced in this context at all.
> SUSv4 says quite unambigous that "[EPIPE] An attempt is made to
> write to a pipe or FIFO that is not open for reading by any process, or
> that only has one end open. A SIGPIPE signal shall also be sent to the
> thread."

That's what I said: "the error is specified to be EPIPE, not EINTR".  This
is somewhat surprising behaviour, and breaks FreeBSD's one working case
for short i/o's -- after EINTR -- iff the signal was internally generated.

> Indeed, POSIX does not specify that we cannot return non-zero (success
> write) and deliver SIGPIPE, but this is just insane. If SIGPIPE handler
> returns, the return value from write(2) must be -1 and errno set to
> EPIPE.

We must return a short write with no SIGPIPE, then SIGPIPE and EPIPE for
the next write (without writing anything).

> For naive programs, which are not aware that stdout can be
> pipe (i.e. the original target audience of SIGPIPE) not delivering the
> signal on write(2) which write anything is just fine, since we can
> make a valid assumption that they would repeat the write, and then
> get EPIPE/SIGPIPE as intended.

This is correct for non-naive programs too, but the assumption isn't
valid.  Non-naive programs don't understand short writes and typically
treat them as errors.  Except ones that use stdio -- stdio doesn't
repeat the whole write, but continues from the point that was
successfully written up to.

> Anyway, as I said, I very much dislike making the generic I/O layer
> decide after the filesystem code, and limiting its ability to report
> errors.

And I very much dislike losing data to report errors.

> I believe that all what is needed is the patch below, which just fixes
> a bug in pipe code (obvious after all the discussions). I inspected the
> pipe_read() and it seems to behave correctly without any change.
>>
>> So the above is not completely specified after all.  EINTR is not
>> intended to apply when the signal is internally generated (and you
>> could argue that the write is not interrupted by a signal in that
>> case; instead, the write fails and generates the signal, which is how
>> this is implemented).  Then since it wasn't interrupted, the clause
>> ...
> Exactly, this is what I mentioned to David as an alternative approach for
> the fix, which does not cause screw-up of the generic I/O layer. I just
> went ahead and implemented the fix, see the patch at the end of message.
> It was tested with David program.

It does not fix the generic I/O layer.  Fifos aren't very important, but
they show the problem with the generic layer.

>> It is the FreeBSD API.
> I do not understand this comment. We have a flexibility in kernel, and why
> should we hack-patch the wrong behaviour of lower layers instead of fixing
> bugs ?

The FreeBSD API passes up both the success count and the error code, so
that splitting up i/o's can work through any number of layers.  Then the
top layer can't return both, so it must return the most important thing
-- the success count.  It is not the responsibility of the layer below
dofile* to change its correct behaviour because it knows that it is 1
layer below a buggy layer.  Most layers don't know their level.

> The 'The FreeBSD KPI' there is (or rather, should be):
> - return changed resid and set error to 0

Might even work, using only 1000 times as much code and some runtime
pessimizations to get the same effect as clearing the error in dofile*.
Consider the dofilewrite -> physwrite -> geom_dev -> driver layering.
write() tries to write 1M and gets an error on the last 512-block.
The i/o is split up something like:
- physio:
   8 * 128K.  7 succeed and the last fails on the last 512-block.
   Since something in the last block didn't fail (say just the last
   512- block, though in practice the driver would normally use larger
   blocks and wouldn't know where the error was), and since there is
   no way to back out of any of the writes (the previous contents of
   the successfully overwritten blocks has been lost), lower layers
   should set error to 0 and return the resid changed from 128K down
   to 512 for the last block.  Now we don't see the error and should
   retry the last 512-block.  This time we see the error.  Now since
   there is no way to back out, we should set error to 0 and return
   8*128K - 512 for the count.
- geom_dev:
   No problems for first 7*128K.  Consider only the last 128K.  Assume
   that it is split again into 2*64K.  1 succeeds and the last fails
   on the last 512-block.  Then we see a short count and retry the
   512-block to no avail, as in the physio layer, and eventually
   return a count of 128K-512 with no error.
- driver:
   Most likely it will try to write the whole 64K, with this failing
   and the driver not knowing where the failure was.  But we are assuming
   that it knows that only the last 512-block fails, and tells us this.
   It might retry the 64K-block one 512-block at a time (the old wd driver
   did this).  So eventually it finds that everything except the last
   512-block can be written.  It returns this to the layer above as usual
   block clearing error and returning a count of 64K-512.  All the layers
   above retry the failing block to no avail, as the error for it filters
   up to the top.

The 1000 as times as much code is because you have to do this in all
drivers.  The runtime pessimizations are the useless retries and the
executing the extra code to manage this.  The end result is that the
dofile* layer never sees the error for short i/o's.  Unlike other
layers, this layer doesn't pretend to understand short i/o's, so it
doesn't retry (lower layers could also skip the retry).

> - return non-zero error, which causes generic layer to ignore resid changes.
> Such KPI provides the maximal freedom for the lower layer, without requiring
> filesystem to try to implement impossible things like uio rollback.
>
> diff --git a/sys/kern/sys_pipe.c b/sys/kern/sys_pipe.c
> index 338256c..1a61b93 100644
> --- a/sys/kern/sys_pipe.c
> +++ b/sys/kern/sys_pipe.c
> @@ -1286,13 +1286,13 @@ pipe_write(fp, uio, active_cred, flags, td)
> 	}
>
> 	/*
> -	 * Don't return EPIPE if I/O was successful
> +	 * Don't return EPIPE if any byte was written.
> +	 * EINTR and other interrupts are handled by generic I/O layer.
> +	 * Do not pretend that I/O succeeded for obvious user error
> +	 * like EFAULT.
> 	 */
> -	if ((wpipe->pipe_buffer.cnt == 0) &&
> -	    (uio->uio_resid == 0) &&
> -	    (error == EPIPE)) {
> +	if (uio->uio_resid != orig_resid && error == EPIPE)
> 		error = 0;
> -	}
>
> 	if (error == 0)
> 		vfs_timestamp(&wpipe->pipe_mtime);

This is quite reasonable, but it only touches EPIPE for pipes, leaving
the general case of EANY for anyfiletype broken.

EFAULT is the next easiest error (or even easier) after ENOPSC for
testing bugs in this area, since the user can generate it.  Consider
what happens for an EFAULT on the uio for the last of the 8 128K-blocks
in the above (physio seems to actually use memory mapping, not
uioread/write()).  The first 7 128K-blocks get written irreversibly
and you hopefully get EFAULT when mapping the last block.  This EFAULT
is just as important as EIO, since the underlying file has been changed.

Oops, I just rememebered the justification for _not_ returning short
writes or backing out of writes in the middle of a regular file.  It
is that the failing part may have changed the underlying file, and if
you return a short write for the successful part then the application
will only learn that something fails if it retries the failing part.
A failing write means that the whole extent of the region where the
write was attempted is indeterminate, and this applies equally to
regular files and disk files.  Applications just shouldn't attempt to
write GBs or TBs at a time, since a failure in the middle leaves them
with either no indication of the extent of the error (if a short count
is returned) or with the possibility of the whole extent being changed
to garbage (if an error is returned).  Even in the kernel where bith
the count and the error are returned, there is no way to tell how much
was turned to garbage beyond the ailure point (this depends on details
of the buffering).  Retrying in a careful application will eventually
find the extent of the error though.  The necessary care seems to turn
a simple write() call into at least 50 lines of error handling :-(.

Bruce